Skip to content

Slurm Job Management

Isambard-AI and Isambard 3 uses the Slurm Workload Manager to run jobs on the compute nodes.

Isambard-AI

SBATCH: Writing job submission scripts

You can run a job by submitting a batch script with the sbatch command. E.g. assume your batch script is called myscript.sh, you can then submit the batch script as follows:

user.project@nid001040:~> sbatch myscript.sh
Submitted batch job 19159

This will add your job to the queue for the compute nodes, and your job will run when the requested resource (as defined in the batch script) is available.

Specify GPU Resource

You must specify GPU resource in your batch script using either --gpus or one of the --gpus-per-* options. Each GPU requested will also allocate 72 CPU cores and 115 GB of Grace RAM, i.e. one Grace Hopper Superchip.

When composing your batch script, please consider the resources available, e.g. you may wish to use srun to run multiple jobs simultaneously across a Grace Hopper Superchip or a node. You should also use the --time directive to set a time limit for your job.

See the examples below for guidance on how to write your batch script.

Running a single GPU job

The following example batch script shows the commands hostname and nvidia-smi running sequentially, requesting one GPU. For a single GPU, 72 CPU cores and 115 GB of Grace RAM will be allocated (one Grace Hopper Superchip)

#!/bin/bash

#SBATCH --job-name=docs_ex1
#SBATCH --output=docs_ex1.out
#SBATCH --gpus=1
#SBATCH --time=00:5:00         # Hours:Mins:Secs

hostname
nvidia-smi --list-gpus

If this file is named docs1.batch, you can submit the job as follows

user.project@nid001040:~> sbatch docs1.batch
Submitted batch job 19159

Checking the output of the job:

user.project@nid001040:~> cat docs_ex1.out 
nid001038
GPU 0: GH200 120GB (UUID: GPU-f9fc6950-574a-5310-dc7c-b8d02cc529db)

The output shows the hostname and the GPU information for the compute node on which the job ran.

Using srun you can also run a single GPU job multiple times in parallel, e.g.

#!/bin/bash

#SBATCH --job-name=docs_ex2
#SBATCH --output=docs_ex2.out
#SBATCH --gpus=2
#SBATCH --ntasks-per-gpu=1
#SBATCH --time=00:5:00

srun nvidia-smi --list-gpus

If we run this batch script and check the output:

user.project@nid001040:~> sbatch docs2.batch 
Submitted batch job 19161
user.project@nid001040:~> cat docs_ex2.out 
GPU 0: GH200 120GB (UUID: GPU-f9fc6950-574a-5310-dc7c-b8d02cc529db)
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)

Checking the UUIDs we can see 2 different GPUs have been allocated.

Specifying different job steps with srun

You can chain together different job steps using &, adding the wait command at the end (to avoid job termination), e.g.

srun --ntasks=1 --gpus=1 --exclusive job_step1 &
srun --ntasks=1 --gpus=1 --exclusive job_step2 &
wait

In a job where 2 or more GPUs have been allocated, this will run the job steps concurrently, allocating 1 GPU to each job step and running 1 task per job step. The srun --exclusive flag here ensures that the job steps are only allocated as much resource as requested in the srun command and that they can run concurrently.

Running a python script on single CPUs simultaneously

Consider the following python script:

#!/usr/bin/env python3

import os
from time import sleep
from datetime import datetime
import socket

sleep(30)

time_now = datetime.now().strftime("%H:%M:%S")

print('Task {}: Hello world from {} at {}.'.format(os.environ["SLURM_PROCID"], socket.gethostname(), time_now))

This script sleeps for 30 seconds, then prints the hostname and the time. If this script is called pysrun.py, then we can write the following batch script, calling pysrun.py three times with srun:

#!/bin/bash

#SBATCH --job-name=docs_ex3
#SBATCH --output=docs_ex3.out
#SBATCH --gpus=1                # this allocates 72 CPU cores
#SBATCH --ntasks-per-gpu=3
#SBATCH --time=00:5:00

module load cray-python

srun python3 pysrun.py

If we run this batch script and check the output:

user.project@nid001040:~> sbatch docs3.batch 
Submitted batch job 19162
user.project@nid001040:~> cat docs_ex3.out 
Task 0: Hello world from nid001016 at 17:20:29.
Task 1: Hello world from nid001016 at 17:20:29.
Task 2: Hello world from nid001016 at 17:20:29.

We can see from the timestamp that the three srun commands have executed simultaneously.

SRUN & SALLOC: Submitting interactive jobs

As well as its above use in job scripts for parallel job submission, srun can also be used to submit a job interactively on the command line. This can be used to run a specific command using compute node resources, e.g.

user.project@nid001040:~> srun --gpus=1 --time=00:02:00 nvidia-smi --list-gpus
srun: job 19164 queued and waiting for resources
srun: job 19164 has been allocated resources

GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)

srun can also be used to start a interactive shell session on a compute node using the --pty option, e.g.

user.project@nid001040:~> srun --gpus=1 --time=00:15:00 --pty /bin/bash --login
srun: job 22874 queued and waiting for resources
srun: job 22874 has been allocated resources
user.project@nid001005:~> hostname
nid001005
user.project@nid001005:~> nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-f00d9a03-840c-5ea2-a748-243383b6efbc)

Used in this way, srun creates a resource allocation before running the specified command or shell in the allocation.

It is also possible to use srun to start an interactive shell session associated with the resources allocated to an already-running job, which can be useful for monitoring and debugging jobs. This is done by specifying the job ID of a currently running job using the --jobid option, e.g.

## Submit batch job
user.project@nid001040:~> sbatch script.batch 
Submitted batch job 22886

## Check job has started running
user.project@nid001040:~> squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  22886 user.project     workq                   script  R      10:00       0:03       9:57     1 nid001005

## Start an interactive shell in a job step using the job's allocated resources
user.project@nid001040:~> srun --ntasks=1 --gpus=1 --jobid=22886 --pty /bin/bash -l

## Run interactive commands in the shell, then exit
user.project@nid001005:~> hostname
nid001005
user.project@nid001005:~> nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-f00d9a03-840c-5ea2-a748-243383b6efbc)
user.project@nid001005:~> exit
logout

## After exiting the interactive shell, the original job continues running
user.project@nid001040:~> squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  22886 user.project     workq                   script  R      10:00       0:39       9:21     1 nid001005

Similarly, salloc can be used to reserve compute node resources, and then srun can be used to run jobs on the requested resource interactively:

user.project@nid001040:~> salloc --gpus=1 --time=00:2:00
salloc: Granted job allocation 19130
user.project@nid001040:~> srun hostname
nid001016
user.project@nid001040:~> srun nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)

Fair Use of Resources

Always set a time limit when using salloc to run jobs interactively, and cancel the job using scancel <JOB_ID> when you have finished (replace <JOB_ID> with the job ID of the allocation).

SQUEUE, SACCT, SCANCEL: Managing jobs

The squeue command shows the jobs running on the system, combine with --me flag to see just your own jobs that are currently running:

user.project@nid001040:~> squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  19132 user.project     workq              interactive  R       2:00       0:53       1:07     1 nid001038

The sacct command shows your current and previously completed jobs:

user.project@nid001040:~> sacct
JobID           JobName  Partition      Account  AllocCPUS        State ExitCode 
------------ ---------- ---------- ------------ ---------- ------------ -------- 
19107            nvtest      workq      project        144    COMPLETED      0:0 
19107.batch       batch                 project        144    COMPLETED      0:0 
19107.0      nvidia-smi                 project        144       FAILED      2:0 
19114            pysrun      workq      project         72    COMPLETED      0:0 
19114.batch       batch                 project         72    COMPLETED      0:0 
19114.0         python3                 project          1    COMPLETED      0:0 
19114.1         python3                 project          1    COMPLETED      0:0 
19114.2         python3                 project          1    COMPLETED      0:0
19130        interacti+      workq      project         72    CANCELLED      0:0 
19132           pysrun2      workq      project         72      RUNNING      0:0 
19132.batch       batch                 project         72      RUNNING      0:0 
19132.0         python3                 project          1      RUNNING      0:0 

scancel is used to cancel a job. E.g. to cancel a job with the ID 19132:

user.project@nid001040:~> scancel 19132

Isambard 3

Isambard 3 consists of 2 clusters, Grace CPU Superchip and Multi-Architecture Comparison System (MACS).

SBATCH: Writing job submission scripts

You can run a job by submitting a batch script with the sbatch command. E.g. assume your batch script is called myscript.sh, you can then submit the batch script as follows:

$ sbatch myscript.sh
Submitted batch job 19159

This will add your job to the queue for the compute nodes, and your job will run when the requested resource (as defined in the batch script) is available.

When composing your batch script, please consider the resources available. You should also use the --time directive to set a time limit for your job.

See the examples below for guidance on how to write your batch script.

Running a single job

The following example batch script shows the commands hostname running sequentially. For a single node, 144 CPU cores and 200 GB of Grace RAM will be allocated (one Grace CPU Superchip)

#!/bin/bash

#SBATCH --job-name=docs_ex1
#SBATCH --output=docs_ex1.out
#SBATCH --time=00:5:00         # Hours:Mins:Secs
hostname
numactl -s

If this file is named docs1.batch, you can submit the job as follows

$ sbatch docs1.batch
Submitted batch job 19159

Checking the output of the job:

$ cat docs_ex1.out 
x3003c0s31b2n0
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 

The output shows the hostname and CPUs made available on the compute node on which the job ran. Note on 1 CPU (or core) is available.

Using srun you can also run a single job multiple times in parallel, e.g.

#!/bin/bash

#SBATCH --job-name=docs_ex2
#SBATCH --output=docs_ex2.out
#SBATCH --time=00:5:00
#SBATCH --ntasks=2
srun numactl -s

If we run this batch script and check the output:

$ sbatch docs2.batch 
Submitted batch job 19161
$ cat docs_ex2.out 
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 
policy: default
preferred node: current
physcpubind: 72 
cpubind: 1 
nodebind: 1 
membind: 0 1 

Checking the physcpubind we can see 2 different CPUs (0 and 72) have been allocated.

Specifying different job steps with srun

You can chain together different job steps using &, adding the wait command at the end (to avoid job termination), e.g.

srun --ntasks=1 --exclusive job_step1 &
srun --ntasks=1 --exclusive job_step2 &
wait

In a job where 2 or more tasks have been allocated, this will run the job steps concurrently, running 1 task per job step. The srun --exclusive flag here ensures that the job steps are only allocated as much resource as requested in the srun command and that they can run concurrently.

Running a python script on single CPUs simultaneously

Consider the following python script:

#!/usr/bin/env python3

import os
from time import sleep
from datetime import datetime
import socket

sleep(30)

time_now = datetime.now().strftime("%H:%M:%S")

print('Task {}: Hello world from {} at {}.'.format(os.environ["SLURM_PROCID"], socket.gethostname(), time_now))

This script sleeps for 30 seconds, then prints the hostname and the time. If this script is called pysrun.py, then we can write the following batch script, calling pysrun.py three times with srun:

#!/bin/bash

#SBATCH --job-name=docs_ex3
#SBATCH --output=docs_ex3.out
#SBATCH --ntasks=3
#SBATCH --time=00:5:00

module load cray-python

srun python3 pysrun.py

If we run this batch script and check the output:

$ sbatch docs3.batch 
Submitted batch job 19162
$ cat docs_ex3.out 
Task 0: Hello world from x3003c0s31b2n0 at 17:20:29.
Task 1: Hello world from x3003c0s31b2n0 at 17:20:29.
Task 2: Hello world from x3003c0s31b2n0 at 17:20:29.

We can see from the timestamp that the three srun commands have executed simultaneously.

SRUN & SALLOC: Submitting interactive jobs

As well as its above use in job scripts for parallel job submission, srun can also be used to submit a job interactively on the command line. This can be used to run a specific command using compute node resources, e.g.

Running a single command

$ srun --time=00:02:00 numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1

srun can also be used to start a interactive shell session on a compute node using the --pty option, e.g.

Running an interactive session

$ srun --time=00:15:00 --pty /bin/bash --login
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 

It is also possible to use srun to start an interactive shell session associated with the resources allocated to an already-running job, which can be useful for monitoring and debugging jobs. This is done by specifying the job ID of a currently running job using the --jobid option, e.g.

Running an interactive shell in an existing job

You are able to interact with an existing batch job using srun. First let us submit a job and check it's running using squeue:

$ sbatch script.batch 
Submitted batch job 23379
$ squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  23379 user.project     grace             script.batch  R  UNLIMITED       0:02  UNLIMITED     1 x3004c0s9b4n0

Then start an interactive shell in a job step using the job's allocated resources, we then run some example commands:

$ srun --ntasks=1 --jobid=23379 --pty /bin/bash -l
$ hostname
x3004c0s9b4n0
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 
$ exit
logout

After exiting the interactive shell, the original job continues running

$ squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  23379 user.project     grace             script.batch  R  UNLIMITED       1:46  UNLIMITED     1 x3004c0s9b4n0

Allocating a compute node as a job

Similarly, salloc can be used to reserve compute node resources, and then srun can be used to run jobs on the requested resource interactively:

$ salloc --time=00:2:00
salloc: Granted job allocation 19130
$ srun hostname
x3004c0s9b4n0
$ srun numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 

Fair Use of Resources

Always set a time limit when using salloc to run jobs interactively, and cancel the job using scancel <JOB_ID> when you have finished (replace <JOB_ID> with the job ID of the allocation).

SQUEUE, SACCT, SCANCEL: Managing jobs

The squeue command shows the jobs running on the system, combine with --me flag to see just your own jobs that are currently running:

$ squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  19132 user.project     workq              interactive  R       2:00       0:53       1:07     1 nid001038

The sacct command shows your current and previously completed jobs:

$ sacct
JobID           JobName  Partition      Account  AllocCPUS        State ExitCode 
------------ ---------- ---------- ------------ ---------- ------------ -------- 
19107            nvtest      workq      project        144    COMPLETED      0:0 
19107.batch       batch                 project        144    COMPLETED      0:0 
19107.0      nvidia-smi                 project        144       FAILED      2:0 
19114            pysrun      workq      project         72    COMPLETED      0:0 
19114.batch       batch                 project         72    COMPLETED      0:0 
19114.0         python3                 project          1    COMPLETED      0:0 
19114.1         python3                 project          1    COMPLETED      0:0 
19114.2         python3                 project          1    COMPLETED      0:0
19130        interacti+      workq      project         72    CANCELLED      0:0 
19132           pysrun2      workq      project         72      RUNNING      0:0 
19132.batch       batch                 project         72      RUNNING      0:0 
19132.0         python3                 project          1      RUNNING      0:0 

scancel is used to cancel a job. E.g. to cancel a job with the ID 19132:

$ scancel 19132

Advanced job management

Job dependencies

The execution of a Slurm job can be made to depend on the state of other jobs in the queue. Making the start of a job conditional on the outcome of other jobs can be useful in various situations. For example:

  • A job depends on the result of one of more other jobs, e.g. jobs A and B perform pre-processing operations on data to be used by job C.
  • One or more jobs perform operations on same dataset which cannot be done in parallel, e.g. jobs A and B both modify dataset C and if this happens simultaneously there will be a race condition.
  • A workload needs to run for a period exceeding the maximum time limit available, but can be broken down into smaller chunks to be executed in sequence, e.g. jobs A, B, C etc. run in sequence, with each saving state before the time limit is exceeded and with subsequent jobs resuming from the saved state.

Dependencies are specified using the sbatch --dependency flag. Each dependency has a type which determines the conditions where the dependency is satisfied (and the job may start).

Using the following simple job script we can demonstrate the effect of using some different dependency types:

submit_dependency.sh
#!/bin/bash
#SBATCH --job-name=dependency_example
#SBATCH --time=1
#SBATCH --ntasks=1

echo "${SLURM_JOB_ID} on $(hostname) at $(date --iso-8601=seconds)"
sleep 30

Without the --dependency flag, submitted jobs can execute simultaneously

user.project@login02:~> sbatch submit_dependency.sh
Submitted batch job 52638
user.project@login02:~> sbatch submit_dependency.sh
Submitted batch job 52639
user.project@login02:~> sbatch submit_dependency.sh
Submitted batch job 52640
user.project@login02:~> sbatch submit_dependency.sh
Submitted batch job 52641

Using squeue with the --Format flag and specifying the Dependency field causes dependency information about each job to be displayed:

user.project@login02:~> squeue --me --Format="JobID,Name,StateCompact:6,TimeUsed,ReasonList,Dependency:32"
JOBID               NAME                ST    TIME                NODELIST(REASON)    DEPENDENCY
52638               dependency_example  R     0:08                x3010c0s31b1n0      (null)
52639               dependency_example  R     0:05                x3010c0s31b1n0      (null)
52640               dependency_example  R     0:05                x3010c0s31b1n0      (null)
52641               dependency_example  R     0:05                x3010c0s31b1n0      (null)

In this case, the jobs ran simultaneously, as shown by their outputs:

user.project@login02:~> cat slurm-*.out
52638 on x3010c0s31b1n0 at 2025-01-23T16:37:45+00:00
52639 on x3010c0s31b1n0 at 2025-01-23T16:37:48+00:00
52640 on x3010c0s31b1n0 at 2025-01-23T16:37:48+00:00
52641 on x3010c0s31b1n0 at 2025-01-23T16:37:48+00:00

A singleton type dependency ensures that only one instance of a job with a specific name and user can run at any one time. Each job submitted with this dependency will wait until any previously launched job with the same name and user currently executing (or suspended) has terminated.

If a job is submitted multiple times with --dependency=singleton and the same job name, then the jobs will run one at a time, e.g.

user.project@login02:~> sbatch --dependency=singleton submit_dependency.sh
Submitted batch job 52642
user.project@login02:~> sbatch --dependency=singleton submit_dependency.sh
Submitted batch job 52643
user.project@login02:~> sbatch --dependency=singleton submit_dependency.sh
Submitted batch job 52644
user.project@login02:~> sbatch --dependency=singleton submit_dependency.sh
Submitted batch job 52645

The output of squeue shows that 1 job is running and the other 3 jobs have unsatisfied singleton type dependencies (the first submitted job's dependency was automatically satisfied as no other jobs of the same name and user were present):

user.project@login02:~> squeue --me --Format="JobID,Name,StateCompact:6,TimeUsed,ReasonList,Dependency:32"
JOBID               NAME                ST    TIME                NODELIST(REASON)    DEPENDENCY
52643               dependency_example  PD    0:00                (Dependency)        singleton(unfulfilled)
52644               dependency_example  PD    0:00                (Dependency)        singleton(unfulfilled)
52645               dependency_example  PD    0:00                (Dependency)        singleton(unfulfilled)
52642               dependency_example  R     0:04                x3010c0s31b1n0      (null)

After all the jobs have completed, the timestamps in the job outputs show that the jobs executed sequentially:

user.project@login02:~> cat slurm-*.out
52642 on x3010c0s31b1n0 at 2025-01-23T16:40:18+00:00
52643 on x3010c0s31b1n0 at 2025-01-23T16:40:49+00:00
52644 on x3010c0s31b1n0 at 2025-01-23T16:41:19+00:00
52645 on x3010c0s31b1n0 at 2025-01-23T16:41:50+00:00

An afterok type dependency specifies that a job should start execution after 1 or more other specified jobs have successfully completed execution (exit code 0). Unlike singleton, the job IDs of each job being depended on must be specified, e.g.

user.project@login02:~> sbatch submit_dependency.sh
Submitted batch job 52646
user.project@login02:~> sbatch --dependency=afterok:52646 submit_dependency.sh
Submitted batch job 52647
user.project@login02:~> sbatch --dependency=afterok:52647 submit_dependency.sh
Submitted batch job 52648
user.project@login02:~> sbatch --dependency=afterok:52648 submit_dependency.sh
Submitted batch job 52649

The output of squeue shows that 1 job is running and 3 of the jobs have unsatisfied afterok type dependencies on particular job IDs (the first job was submitted without a dependency):

user.project@login02:~> squeue --me --Format="JobID,Name,StateCompact:6,TimeUsed,ReasonList,Dependency:32"
JOBID               NAME                ST    TIME                NODELIST(REASON)    DEPENDENCY
52647               dependency_example  PD    0:00                (Dependency)        afterok:52646(unfulfilled)
52648               dependency_example  PD    0:00                (Dependency)        afterok:52647(unfulfilled)
52649               dependency_example  PD    0:00                (Dependency)        afterok:52648(unfulfilled)
52646               dependency_example  R     0:19                x3010c0s31b1n0      (null)

After all the jobs have completed, the timestamps in the job outputs show that the jobs executed sequentially:

user.project@login02:~> cat slurm-*.out
52646 on x3010c0s31b1n0 at 2025-01-23T16:44:05+00:00
52647 on x3010c0s31b1n0 at 2025-01-23T16:44:35+00:00
52648 on x3010c0s31b1n0 at 2025-01-23T16:45:06+00:00
52649 on x3010c0s31b1n0 at 2025-01-23T16:45:36+00:00

Other job dependency types are available and may be combined to create more complex sets of dependencies. For full details see the sbatch man page.

Automatically getting the job ID of a submitted job

The sbatch --parsable flag causes sbatch to output the job ID in a machine-parsable form. This can be used to set the value of a shell variable to the job ID of a submitted job, which can then be used to define a job dependency, e.g.

user.project@login02:~> JOBID_1=$(sbatch --parsable submit_dependency.sh)
user.project@login02:~> JOBID_2=$(sbatch --parsable --dependency=afterok:${JOBID_1} submit_dependency.sh)
user.project@login02:~> echo "JOBID_1 = ${JOBID_1}, JOBID_2 = ${JOBID_2}"
JOBID_1 = 52650, JOBID_2 = 52651