Skip to content

Slurm Job Management

Isambard-AI and Isambard 3 uses the Slurm Workload Manager to run jobs on the compute nodes.

Isambard-AI

SBATCH: Writing job submission scripts

You can run a job by submitting a batch script with the sbatch command. E.g. assume your batch script is called myscript.sh, you can then submit the batch script as follows:

user.project@nid001040:~> sbatch myscript.sh
Submitted batch job 19159

This will add your job to the queue for the compute nodes, and your job will run when the requested resource (as defined in the batch script) is available.

Specify GPU Resource

You must specify GPU resource in your batch script using either --gpus or one of the --gpus-per-* options. Each GPU requested will also allocate 72 CPU cores and 115 GB of Grace RAM, i.e. one Grace Hopper Superchip.

When composing your batch script, please consider the resources available, e.g. you may wish to use srun to run multiple jobs simultaneously across a Grace Hopper Superchip or a node. You should also use the --time directive to set a time limit for your job.

See the examples below for guidance on how to write your batch script.

Running a single GPU job

The following example batch script shows the commands hostname and nvidia-smi running sequentially, requesting one GPU. For a single GPU, 72 CPU cores and 115 GB of Grace RAM will be allocated (one Grace Hopper Superchip)

#!/bin/bash

#SBATCH --job-name=docs_ex1
#SBATCH --output=docs_ex1.out
#SBATCH --gpus=1
#SBATCH --time=00:5:00         # Hours:Mins:Secs

hostname
nvidia-smi --list-gpus

If this file is named docs1.batch, you can submit the job as follows

user.project@nid001040:~> sbatch docs1.batch
Submitted batch job 19159

Checking the output of the job:

user.project@nid001040:~> cat docs_ex1.out 
nid001038
GPU 0: GH200 120GB (UUID: GPU-f9fc6950-574a-5310-dc7c-b8d02cc529db)

The output shows the hostname and the GPU information for the compute node on which the job ran.

Using srun you can also run a single GPU job multiple times in parallel, e.g.

#!/bin/bash

#SBATCH --job-name=docs_ex2
#SBATCH --output=docs_ex2.out
#SBATCH --gpus=2
#SBATCH --ntasks-per-gpu=1
#SBATCH --time=00:5:00

srun nvidia-smi --list-gpus

If we run this batch script and check the output:

user.project@nid001040:~> sbatch docs2.batch 
Submitted batch job 19161
user.project@nid001040:~> cat docs_ex2.out 
GPU 0: GH200 120GB (UUID: GPU-f9fc6950-574a-5310-dc7c-b8d02cc529db)
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)

Checking the UUIDs we can see 2 different GPUs have been allocated.

Specifying different job steps with srun

You can chain together different job steps using &, adding the wait command at the end (to avoid job termination), e.g.

srun --ntasks=1 --gpus=1 --exclusive job_step1 &
srun --ntasks=1 --gpus=1 --exclusive job_step2 &
wait

In a job where 2 or more GPUs have been allocated, this will run the job steps concurrently, allocating 1 GPU to each job step and running 1 task per job step. The srun --exclusive flag here ensures that the job steps are only allocated as much resource as requested in the srun command and that they can run concurrently.

Running a python script on single CPUs simultaneously

Consider the following python script:

#!/usr/bin/env python3

import os
from time import sleep
from datetime import datetime
import socket

sleep(30)

time_now = datetime.now().strftime("%H:%M:%S")

print('Task {}: Hello world from {} at {}.'.format(os.environ["SLURM_PROCID"], socket.gethostname(), time_now))

This script sleeps for 30 seconds, then prints the hostname and the time. If this script is called pysrun.py, then we can write the following batch script, calling pysrun.py three times with srun:

#!/bin/bash

#SBATCH --job-name=docs_ex3
#SBATCH --output=docs_ex3.out
#SBATCH --gpus=1                # this allocates 72 CPU cores
#SBATCH --ntasks-per-gpu=3
#SBATCH --time=00:5:00

module load cray-python

srun python3 pysrun.py

If we run this batch script and check the output:

user.project@nid001040:~> sbatch docs3.batch 
Submitted batch job 19162
user.project@nid001040:~> cat docs_ex3.out 
Task 0: Hello world from nid001016 at 17:20:29.
Task 1: Hello world from nid001016 at 17:20:29.
Task 2: Hello world from nid001016 at 17:20:29.

We can see from the timestamp that the three srun commands have executed simultaneously.

SRUN & SALLOC: Submitting interactive jobs

As well as its above use in job scripts for parallel job submission, srun can also be used to submit a job interactively on the command line. This can be used to run a specific command using compute node resources, e.g.

user.project@nid001040:~> srun --gpus=1 --time=00:02:00 nvidia-smi --list-gpus
srun: job 19164 queued and waiting for resources
srun: job 19164 has been allocated resources

GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)

srun can also be used to start a interactive shell session on a compute node using the --pty option, e.g.

user.project@nid001040:~> srun --gpus=1 --time=00:15:00 --pty /bin/bash --login
srun: job 22874 queued and waiting for resources
srun: job 22874 has been allocated resources
user.project@nid001005:~> hostname
nid001005
user.project@nid001005:~> nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-f00d9a03-840c-5ea2-a748-243383b6efbc)

Used in this way, srun creates a resource allocation before running the specified command or shell in the allocation.

It is also possible to use srun to start an interactive shell session associated with the resources allocated to an already-running job, which can be useful for monitoring and debugging jobs. This is done by specifying the job ID of a currently running job using the --jobid option, e.g.

## Submit batch job
user.project@nid001040:~> sbatch script.batch 
Submitted batch job 22886

## Check job has started running
user.project@nid001040:~> squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  22886 user.project     workq                   script  R      10:00       0:03       9:57     1 nid001005

## Start an interactive shell in a job step using the job's allocated resources
user.project@nid001040:~> srun --ntasks=1 --gpus=1 --jobid=22886 --pty /bin/bash -l

## Run interactive commands in the shell, then exit
user.project@nid001005:~> hostname
nid001005
user.project@nid001005:~> nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-f00d9a03-840c-5ea2-a748-243383b6efbc)
user.project@nid001005:~> exit
logout

## After exiting the interactive shell, the original job continues running
user.project@nid001040:~> squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  22886 user.project     workq                   script  R      10:00       0:39       9:21     1 nid001005

Similarly, salloc can be used to reserve compute node resources, and then srun can be used to run jobs on the requested resource interactively:

user.project@nid001040:~> salloc --gpus=1 --time=00:2:00
salloc: Granted job allocation 19130
user.project@nid001040:~> srun hostname
nid001016
user.project@nid001040:~> srun nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)

Fair Use of Resources

Always set a time limit when using salloc to run jobs interactively, and cancel the job using scancel <JOB_ID> when you have finished (replace <JOB_ID> with the job ID of the allocation).

SQUEUE, SACCT, SCANCEL: Managing jobs

The squeue command shows the jobs running on the system, combine with --me flag to see just your own jobs that are currently running:

user.project@nid001040:~> squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  19132 user.project     workq              interactive  R       2:00       0:53       1:07     1 nid001038

The sacct command shows your current and previously completed jobs:

user.project@nid001040:~> sacct
JobID           JobName  Partition      Account  AllocCPUS        State ExitCode 
------------ ---------- ---------- ------------ ---------- ------------ -------- 
19107            nvtest      workq      project        144    COMPLETED      0:0 
19107.batch       batch                 project        144    COMPLETED      0:0 
19107.0      nvidia-smi                 project        144       FAILED      2:0 
19114            pysrun      workq      project         72    COMPLETED      0:0 
19114.batch       batch                 project         72    COMPLETED      0:0 
19114.0         python3                 project          1    COMPLETED      0:0 
19114.1         python3                 project          1    COMPLETED      0:0 
19114.2         python3                 project          1    COMPLETED      0:0
19130        interacti+      workq      project         72    CANCELLED      0:0 
19132           pysrun2      workq      project         72      RUNNING      0:0 
19132.batch       batch                 project         72      RUNNING      0:0 
19132.0         python3                 project          1      RUNNING      0:0 

scancel is used to cancel a job. E.g. to cancel a job with the ID 19132:

user.project@nid001040:~> scancel 19132

Isambard 3

Isambard 3 consists of 2 clusters, Grace CPU Superchip and Multi-Architecture Comparison System (MACS).

SBATCH: Writing job submission scripts

You can run a job by submitting a batch script with the sbatch command. E.g. assume your batch script is called myscript.sh, you can then submit the batch script as follows:

$ sbatch myscript.sh
Submitted batch job 19159

This will add your job to the queue for the compute nodes, and your job will run when the requested resource (as defined in the batch script) is available.

When composing your batch script, please consider the resources available. You should also use the --time directive to set a time limit for your job.

See the examples below for guidance on how to write your batch script.

Running a single job

The following example batch script shows the commands hostname running sequentially. For a single node, 144 CPU cores and 200 GB of Grace RAM will be allocated (one Grace CPU Superchip)

#!/bin/bash

#SBATCH --job-name=docs_ex1
#SBATCH --output=docs_ex1.out
#SBATCH --time=00:5:00         # Hours:Mins:Secs
hostname
numactl -s

If this file is named docs1.batch, you can submit the job as follows

$ sbatch docs1.batch
Submitted batch job 19159

Checking the output of the job:

$ cat docs_ex1.out 
x3003c0s31b2n0
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 

The output shows the hostname and CPUs made available on the compute node on which the job ran. Note on 1 CPU (or core) is available.

Using srun you can also run a single job multiple times in parallel, e.g.

#!/bin/bash

#SBATCH --job-name=docs_ex2
#SBATCH --output=docs_ex2.out
#SBATCH --time=00:5:00
#SBATCH --ntasks=2
srun numactl -s

If we run this batch script and check the output:

$ sbatch docs2.batch 
Submitted batch job 19161
$ cat docs_ex2.out 
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 
policy: default
preferred node: current
physcpubind: 72 
cpubind: 1 
nodebind: 1 
membind: 0 1 

Checking the physcpubind we can see 2 different CPUs (0 and 72) have been allocated.

Specifying different job steps with srun

You can chain together different job steps using &, adding the wait command at the end (to avoid job termination), e.g.

srun --ntasks=1 --exclusive job_step1 &
srun --ntasks=1 --exclusive job_step2 &
wait

In a job where 2 or more tasks have been allocated, this will run the job steps concurrently, running 1 task per job step. The srun --exclusive flag here ensures that the job steps are only allocated as much resource as requested in the srun command and that they can run concurrently.

Running a python script on single CPUs simultaneously

Consider the following python script:

#!/usr/bin/env python3

import os
from time import sleep
from datetime import datetime
import socket

sleep(30)

time_now = datetime.now().strftime("%H:%M:%S")

print('Task {}: Hello world from {} at {}.'.format(os.environ["SLURM_PROCID"], socket.gethostname(), time_now))

This script sleeps for 30 seconds, then prints the hostname and the time. If this script is called pysrun.py, then we can write the following batch script, calling pysrun.py three times with srun:

#!/bin/bash

#SBATCH --job-name=docs_ex3
#SBATCH --output=docs_ex3.out
#SBATCH --ntasks=3
#SBATCH --time=00:5:00

module load cray-python

srun python3 pysrun.py

If we run this batch script and check the output:

$ sbatch docs3.batch 
Submitted batch job 19162
$ cat docs_ex3.out 
Task 0: Hello world from x3003c0s31b2n0 at 17:20:29.
Task 1: Hello world from x3003c0s31b2n0 at 17:20:29.
Task 2: Hello world from x3003c0s31b2n0 at 17:20:29.

We can see from the timestamp that the three srun commands have executed simultaneously.

SRUN & SALLOC: Submitting interactive jobs

As well as its above use in job scripts for parallel job submission, srun can also be used to submit a job interactively on the command line. This can be used to run a specific command using compute node resources, e.g.

Running a single command

$ srun --time=00:02:00 numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1

srun can also be used to start a interactive shell session on a compute node using the --pty option, e.g.

Running an interactive session

$ srun --time=00:15:00 --pty /bin/bash --login
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 

It is also possible to use srun to start an interactive shell session associated with the resources allocated to an already-running job, which can be useful for monitoring and debugging jobs. This is done by specifying the job ID of a currently running job using the --jobid option, e.g.

Running an interactive shell in an existing job

You are able to interact with an existing batch job using srun. First let us submit a job and check it's running using squeue:

$ sbatch script.batch 
Submitted batch job 23379
$ squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  23379 user.project     grace             script.batch  R  UNLIMITED       0:02  UNLIMITED     1 x3004c0s9b4n0

Then start an interactive shell in a job step using the job's allocated resources, we then run some example commands:

$ srun --ntasks=1 --jobid=23379 --pty /bin/bash -l
$ hostname
x3004c0s9b4n0
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 
$ exit
logout

After exiting the interactive shell, the original job continues running

$ squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  23379 user.project     grace             script.batch  R  UNLIMITED       1:46  UNLIMITED     1 x3004c0s9b4n0

Allocating a compute node as a job

Similarly, salloc can be used to reserve compute node resources, and then srun can be used to run jobs on the requested resource interactively:

$ salloc --time=00:2:00
salloc: Granted job allocation 19130
$ srun hostname
x3004c0s9b4n0
$ srun numactl -s
policy: default
preferred node: current
physcpubind: 0 
cpubind: 0 
nodebind: 0 
membind: 0 1 

Fair Use of Resources

Always set a time limit when using salloc to run jobs interactively, and cancel the job using scancel <JOB_ID> when you have finished (replace <JOB_ID> with the job ID of the allocation).

SQUEUE, SACCT, SCANCEL: Managing jobs

The squeue command shows the jobs running on the system, combine with --me flag to see just your own jobs that are currently running:

$ squeue --me
  JOBID         USER PARTITION                     NAME ST TIME_LIMIT       TIME  TIME_LEFT NODES NODELIST(REASON)
  19132 user.project     workq              interactive  R       2:00       0:53       1:07     1 nid001038

The sacct command shows your current and previously completed jobs:

$ sacct
JobID           JobName  Partition      Account  AllocCPUS        State ExitCode 
------------ ---------- ---------- ------------ ---------- ------------ -------- 
19107            nvtest      workq      project        144    COMPLETED      0:0 
19107.batch       batch                 project        144    COMPLETED      0:0 
19107.0      nvidia-smi                 project        144       FAILED      2:0 
19114            pysrun      workq      project         72    COMPLETED      0:0 
19114.batch       batch                 project         72    COMPLETED      0:0 
19114.0         python3                 project          1    COMPLETED      0:0 
19114.1         python3                 project          1    COMPLETED      0:0 
19114.2         python3                 project          1    COMPLETED      0:0
19130        interacti+      workq      project         72    CANCELLED      0:0 
19132           pysrun2      workq      project         72      RUNNING      0:0 
19132.batch       batch                 project         72      RUNNING      0:0 
19132.0         python3                 project          1      RUNNING      0:0 

scancel is used to cancel a job. E.g. to cancel a job with the ID 19132:

$ scancel 19132