How Isambard works

This page explains the key concepts behind running work on Isambard. If you are used to running code on a laptop, workstation, or cloud platform, a few things work differently on the BriCS Isambard platforms.

Jobs, not programs¶

On a laptop or cloud instance, you start a program and it runs immediately. On Isambard, all computation goes through a job scheduler called Slurm. Slurm manages a shared queue of requests from all users and allocates compute nodes as resources become available.

Unlike a cloud virtual machine, a compute node on Isambard is not started by you and does not persist between jobs. Resources are allocated when Slurm starts your job and released when it finishes.

There are three main ways to run work.

Batch jobs (sbatch) are the most common approach. You write a script describing what to run and what resources you need, then submit it with sbatch. Slurm queues the job and runs it when resources are free, writing output to a file. You do not need to stay logged in while it runs, making this well-suited to long-running or repetitive workloads.

Interactive sessions (srun) give you a shell directly on a compute node. Commands run immediately and you see output as it appears, just as you would on a desktop. This is useful for development, debugging, and exploratory work where you need to react to results in real time.

JupyterHub provides a browser-based notebook environment on Isambard-AI Phase 2. It handles job submission automatically — you configure your session through a web form and JupyterHub requests the resources from Slurm on your behalf. This is a good starting point if you prefer working in notebooks without using the command line.

All three methods go through Slurm, so all are subject to queuing. If the resources you need are fully occupied by other users' jobs, your request waits in the queue until they become available. For batch jobs this happens silently — you can log out and the job will start when its turn comes. For interactive sessions and JupyterHub, your terminal or browser will wait until a node is allocated before giving you access.

The login nodes are the main access point when working with the command line. Accessed via SSH, they are used for managing jobs submitted to the compute nodes via Slurm, and for basic tasks such as editing files, submitting jobs, and checking results. Long-running or compute-intensive tasks are not permitted.

Login nodes are shared

Login nodes are shared and resource-limited and must not be used for running intensive workloads. Anything computationally heavy must go through Slurm as a job.

Login nodes may have different hardware

The login nodes may have different hardware to the compute nodes. This can mean that software compiled or built on the login nodes may not be optimal on compute nodes due to hardware auto-detection.

Compute nodes¶

The compute nodes are the machines where your jobs actually run. Unlike a cloud platform, there is no catalogue of instance types to choose from. The hardware is fixed per system: you request a number of GPUs or CPU cores and Slurm allocates the corresponding share of a node. Each system has a different hardware configuration, which affects what software will work and how you request resources.

Isambard-AIIsambard 3

Each Isambard-AI compute node contains 4 NVIDIA GH200 Grace Hopper Superchips. Each superchip pairs one 72-core Grace CPU with one H100 GPU, connected by a high-bandwidth NVLink-C2C link. A single node therefore provides:

4 GPUs (H100 Tensor Core, 96 GB GPU memory each)
288 CPU cores (4 × 72-core Grace)
~460 GB usable CPU memory plus 384 GB GPU memory

The minimum allocatable unit is 1 GPU (one quarter of a node). Even if you only need CPU cores, you will be allocated at least one GPU's worth of resources.

Grace CPUs use the ARM (aarch64) architecture. Code compiled for x86_64 — including Conda environments created on an x86 machine or Docker images pulled from cloud-based registries — will not work. See the Containers guide and Python guide for guidance on building compatible software.

Each Isambard 3 Grace compute node contains 1 NVIDIA Grace CPU Superchip with 2 × 72-core Grace CPUs and 240 GB of memory. There are no GPUs on the standard Grace nodes.

The minimum allocatable unit is 1 CPU core (1/144th of a node).

Grace CPUs use the ARM (aarch64) architecture. Code compiled for x86_64 — including most Docker images from public registries — will not work on the Grace nodes. Isambard 3 also includes the Multi-Architecture Comparison System (MACS), which provides various x86_64 nodes — see System Specifications for details.

Docker is not available on Isambard

Use Apptainer instead. Apptainer can convert Docker images from public registries and run them without root privileges, making it well-suited to shared HPC systems.

Compiled languages such as C, C++, and Fortran require AArch64 binaries; x86_64 binaries that you may have from other systems will not run. We recommend compiling your code on our system to ensure that your targets are built against compatible system libraries (CUDA, MPI, glibc etc.). Compilers and optimised libraries for the ARM hardware are available via the modules system. For more complex software stacks, Spack provides a package manager for building and managing compiled research software.

Storage¶

Isambard provides a POSIX parallel filesystem — not cloud object storage such as Amazon S3 or Google Cloud Storage. All data is accessed as files and directories using standard filesystem operations, but storage is divided into several named areas, each with a different purpose, quota, and lifetime.

Storage area	Purpose	Key limits
`$HOME`	Config files, scripts, small outputs	100 GiB. If this fills up, you may not be able to log in via SSH.
`$SCRATCHDIR`	Working data, checkpoints, intermediate results	5 TiB per user. On Isambard 3, files not accessed for 60 days are deleted.
`$PROJECTDIR`	Shared datasets and environments within your project.	20 TiB (Isambard 3) or 200 TiB (Isambard-AI) per project.

No storage on Isambard is backed up

All storage on Isambard is working storage only. You are responsible for copying important data to your own external storage before your project ends.

Public directory in $PROJECTDIR

The public directory in your $PROJECTDIR, also reached using $PROJECTDIR_PUBLIC, is visible to all users from all projects. Do not use for sensitive or restrictively-licenced data or software.

See Storage Spaces for commands to check your current usage against these limits.

Project budget¶

Rather than pay-as-you-go cloud billing, Isambard uses a pre-allocated budget model. Every project is allocated a number of node hours (NHR) upfront. One NHR represents one full node used for one hour. If your job uses less than a full node, you are charged proportionally — for example, using 1 GPU on Isambard-AI (one quarter of a node) for one hour costs 0.25 NHR.

You can check your project's remaining balance and per-member usage in the BriCS Portal.

A few things to bear in mind:

Jobs will not start if your project does not have enough NHR to cover the requested walltime.
Running jobs are never killed when a project runs out of budget, but no new jobs will start until more credit is available.
Node hours that are not used by the end of your project are lost — there is no way to carry them beyond the project end date.

For full details on how usage is calculated, see the Accounting guide.

Where to go next¶

The documentation is organised into three types:

Tutorials walk you through key tasks step-by-step to build confidence — start here if you are new to the system.
Guides address specific tasks and assume you know what you want to achieve.
Information pages like this one explain system concepts and behaviour.

Get started:

Setup — Complete your first login through the portal. (Tutorial)
Log in — Connect via SSH. (Guide)

Running work:

Slurm Job Management — Submit, monitor, and manage jobs. (Guide)
JupyterHub — Browser-based notebooks on Isambard-AI Phase 2. (Guide)

Software:

Modules and Compilers — Access compilers and optimised libraries. (Guide)
Python — Install and manage Python environments. (Guide)
Containers — Run software with Apptainer. (Guide)
Spack — Build complex compiled software stacks. (Guide)

Storage and accounting:

Storage Spaces — Full details on quotas, paths, and permissions. (Information)
Accounting — Understand how node hours are calculated and tracked. (Guide)