Slurm Job Management¶
Isambard-AI and Isambard 3 uses the Slurm Workload Manager to run jobs on the compute nodes.
Isambard-AI¶
SBATCH: Writing job submission scripts¶
You can run a job by submitting a batch script with the sbatch
command. E.g. assume your batch script is called myscript.sh
, you can then submit the batch script as follows:
user.project@nid001040:~> sbatch myscript.sh
Submitted batch job 19159
This will add your job to the queue for the compute nodes, and your job will run when the requested resource (as defined in the batch script) is available.
Specify GPU Resource
You must specify GPU resource in your batch script using either --gpus
or one of the --gpus-per-*
options. Each GPU requested will also allocate 72 CPU cores and 115 GB of Grace RAM, i.e. one Grace Hopper Superchip.
When composing your batch script, please consider the resources available, e.g. you may wish to use srun
to run multiple jobs simultaneously across a Grace Hopper Superchip or a node. You should also use the --time
directive to set a time limit for your job.
See the examples below for guidance on how to write your batch script.
Running a single GPU job¶
The following example batch script shows the commands hostname
and nvidia-smi
running sequentially, requesting one GPU. For a single GPU, 72 CPU cores and 115 GB of Grace RAM will be allocated (one Grace Hopper Superchip)
#!/bin/bash
#SBATCH --job-name=docs_ex1
#SBATCH --output=docs_ex1.out
#SBATCH --gpus=1
#SBATCH --time=00:5:00 # Hours:Mins:Secs
hostname
nvidia-smi --list-gpus
If this file is named docs1.batch
, you can submit the job as follows
user.project@nid001040:~> sbatch docs1.batch
Submitted batch job 19159
Checking the output of the job:
user.project@nid001040:~> cat docs_ex1.out
nid001038
GPU 0: GH200 120GB (UUID: GPU-f9fc6950-574a-5310-dc7c-b8d02cc529db)
The output shows the hostname and the GPU information for the compute node on which the job ran.
Using srun
you can also run a single GPU job multiple times in parallel, e.g.
#!/bin/bash
#SBATCH --job-name=docs_ex2
#SBATCH --output=docs_ex2.out
#SBATCH --gpus=2
#SBATCH --ntasks-per-gpu=1
#SBATCH --time=00:5:00
srun nvidia-smi --list-gpus
If we run this batch script and check the output:
user.project@nid001040:~> sbatch docs2.batch
Submitted batch job 19161
user.project@nid001040:~> cat docs_ex2.out
GPU 0: GH200 120GB (UUID: GPU-f9fc6950-574a-5310-dc7c-b8d02cc529db)
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)
Checking the UUIDs we can see 2 different GPUs have been allocated.
Specifying different job steps with srun
You can chain together different job steps using &
, adding the wait
command at the end (to avoid job termination), e.g.
srun job_step1 &
srun job_step2 &
wait
Running a python script on single CPUs simultaneously¶
Consider the following python script:
#!/usr/bin/env python3
import os
from time import sleep
from datetime import datetime
import socket
sleep(30)
time_now = datetime.now().strftime("%H:%M:%S")
print('Task {}: Hello world from {} at {}.'.format(os.environ["SLURM_PROCID"], socket.gethostname(), time_now))
This script sleeps for 30 seconds, then prints the hostname and the time. If this script is called pysrun.py
, then we can write the following batch script, calling pysrun.py
three times with srun
:
#!/bin/bash
#SBATCH --job-name=docs_ex3
#SBATCH --output=docs_ex3.out
#SBATCH --gpus=1 # this allocates 72 CPU cores
#SBATCH --ntasks-per-gpu=3
#SBATCH --time=00:5:00
module load cray-python
srun python3 pysrun.py
If we run this batch script and check the output:
user.project@nid001040:~> sbatch docs3.batch
Submitted batch job 19162
user.project@nid001040:~> cat docs_ex3.out
Task 0: Hello world from nid001016 at 17:20:29.
Task 1: Hello world from nid001016 at 17:20:29.
Task 2: Hello world from nid001016 at 17:20:29.
We can see from the timestamp that the three srun
commands have executed simultaneously.
SRUN & SALLOC: Submitting interactive jobs¶
As well as its above use in job scripts for parallel job submission, srun
can also be used to submit a job interactively on the command line. This can be used to run a specific command using compute node resources, e.g.
user.project@nid001040:~> srun --gpus=1 --time=00:02:00 nvidia-smi --list-gpus
srun: job 19164 queued and waiting for resources
srun: job 19164 has been allocated resources
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)
srun
can also be used to start a interactive shell session on a compute node using the --pty
option, e.g.
user.project@nid001040:~> srun --gpus=1 --time=00:15:00 --pty /bin/bash --login
srun: job 22874 queued and waiting for resources
srun: job 22874 has been allocated resources
user.project@nid001005:~> hostname
nid001005
user.project@nid001005:~> nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-f00d9a03-840c-5ea2-a748-243383b6efbc)
Used in this way, srun
creates a resource allocation before running the specified command or shell in the allocation.
It is also possible to use srun
to start an interactive shell session associated with the resources allocated to an already-running job, which can be useful for monitoring and debugging jobs. This is done by specifying the job ID of a currently running job using the --jobid
option, e.g.
# Submit batch job
user.project@nid001040:~> sbatch script.batch
Submitted batch job 22886
# Check job has started running
user.project@nid001040:~> squeue --me
JOBID USER PARTITION NAME ST TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
22886 user.project workq script R 10:00 0:03 9:57 1 nid001005
# Start an interactive shell in a job step using the job's allocated resources
user.project@nid001040:~> srun --ntasks=1 --gpus=1 --jobid=22886 --pty /bin/bash -l
# Run interactive commands in the shell, then exit
user.project@nid001005:~> hostname
nid001005
user.project@nid001005:~> nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-f00d9a03-840c-5ea2-a748-243383b6efbc)
user.project@nid001005:~> exit
logout
# After exiting the interactive shell, the original job continues running
user.project@nid001040:~> squeue --me
JOBID USER PARTITION NAME ST TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
22886 user.project workq script R 10:00 0:39 9:21 1 nid001005
Similarly, salloc
can be used to reserve compute node resources, and then srun
can be used to run jobs on the requested resource interactively:
user.project@nid001040:~> salloc --gpus=1 --time=00:2:00
salloc: Granted job allocation 19130
user.project@nid001040:~> srun hostname
nid001016
user.project@nid001040:~> srun nvidia-smi --list-gpus
GPU 0: GH200 120GB (UUID: GPU-4833ca67-f003-3dbd-1d44-a5c7645a5ae3)
Fair Use of Resources
Always set a time limit when using salloc
to run jobs interactively, and cancel the job using scancel <JOB_ID>
when you have finished (replace <JOB_ID>
with the job ID of the allocation).
SQUEUE, SACCT, SCANCEL: Managing jobs¶
The squeue
command shows the jobs running on the system, combine with --me
flag to see just your own jobs that are currently running:
user.project@nid001040:~> squeue --me
JOBID USER PARTITION NAME ST TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
19132 user.project workq interactive R 2:00 0:53 1:07 1 nid001038
The sacct
command shows your current and previously completed jobs:
user.project@nid001040:~> sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ------------ ---------- ------------ --------
19107 nvtest workq project 144 COMPLETED 0:0
19107.batch batch project 144 COMPLETED 0:0
19107.0 nvidia-smi project 144 FAILED 2:0
19114 pysrun workq project 72 COMPLETED 0:0
19114.batch batch project 72 COMPLETED 0:0
19114.0 python3 project 1 COMPLETED 0:0
19114.1 python3 project 1 COMPLETED 0:0
19114.2 python3 project 1 COMPLETED 0:0
19130 interacti+ workq project 72 CANCELLED 0:0
19132 pysrun2 workq project 72 RUNNING 0:0
19132.batch batch project 72 RUNNING 0:0
19132.0 python3 project 1 RUNNING 0:0
scancel
is used to cancel a job. E.g. to cancel a job with the ID 19132:
user.project@nid001040:~> scancel 19132
Isambard 3¶
Isambard 3 consists of 2 clusters, Grace CPU Superchip and Multi-Architecture Comparison System (MACS).
SBATCH: Writing job submission scripts¶
You can run a job by submitting a batch script with the sbatch
command. E.g. assume your batch script is called myscript.sh
, you can then submit the batch script as follows:
$ sbatch myscript.sh
Submitted batch job 19159
This will add your job to the queue for the compute nodes, and your job will run when the requested resource (as defined in the batch script) is available.
When composing your batch script, please consider the resources available. You should also use the --time
directive to set a time limit for your job.
See the examples below for guidance on how to write your batch script.
Running a single job¶
The following example batch script shows the commands hostname
running sequentially. For a single node, 144 CPU cores and 200 GB of Grace RAM will be allocated (one Grace CPU Superchip)
#!/bin/bash
#SBATCH --job-name=docs_ex1
#SBATCH --output=docs_ex1.out
#SBATCH --time=00:5:00 # Hours:Mins:Secs
hostname
numactl -s
If this file is named docs1.batch
, you can submit the job as follows
$ sbatch docs1.batch
Submitted batch job 19159
Checking the output of the job:
$ cat docs_ex1.out
x3003c0s31b2n0
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
The output shows the hostname and CPUs made available on the compute node on which the job ran. Note on 1 CPU (or core) is available.
Using srun
you can also run a single job multiple times in parallel, e.g.
#!/bin/bash
#SBATCH --job-name=docs_ex2
#SBATCH --output=docs_ex2.out
#SBATCH --time=00:5:00
#SBATCH --ntasks=2
srun numactl -s
If we run this batch script and check the output:
$ sbatch docs2.batch
Submitted batch job 19161
$ cat docs_ex2.out
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
policy: default
preferred node: current
physcpubind: 72
cpubind: 1
nodebind: 1
membind: 0 1
Checking the physcpubind
we can see 2 different CPUs (0 and 72) have been allocated.
Specifying different job steps with srun
You can chain together different job steps using &
, adding the wait
command at the end (to avoid job termination), e.g.
srun job_step1 &
srun job_step2 &
wait
Running a python script on single CPUs simultaneously¶
Consider the following python script:
#!/usr/bin/env python3
import os
from time import sleep
from datetime import datetime
import socket
sleep(30)
time_now = datetime.now().strftime("%H:%M:%S")
print('Task {}: Hello world from {} at {}.'.format(os.environ["SLURM_PROCID"], socket.gethostname(), time_now))
This script sleeps for 30 seconds, then prints the hostname and the time. If this script is called pysrun.py
, then we can write the following batch script, calling pysrun.py
three times with srun
:
#!/bin/bash
#SBATCH --job-name=docs_ex3
#SBATCH --output=docs_ex3.out
#SBATCH --ntasks=3
#SBATCH --time=00:5:00
module load cray-python
srun python3 pysrun.py
If we run this batch script and check the output:
$ sbatch docs3.batch
Submitted batch job 19162
$ cat docs_ex3.out
Task 0: Hello world from x3003c0s31b2n0 at 17:20:29.
Task 1: Hello world from x3003c0s31b2n0 at 17:20:29.
Task 2: Hello world from x3003c0s31b2n0 at 17:20:29.
We can see from the timestamp that the three srun
commands have executed simultaneously.
SRUN & SALLOC: Submitting interactive jobs¶
As well as its above use in job scripts for parallel job submission, srun
can also be used to submit a job interactively on the command line. This can be used to run a specific command using compute node resources, e.g.
Running a single command¶
$ srun --time=00:02:00 numactl -s
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
srun
can also be used to start a interactive shell session on a compute node using the --pty
option, e.g.
Running an interactive session¶
$ srun --time=00:15:00 --pty /bin/bash --login
$ numactl -s
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
It is also possible to use srun
to start an interactive shell session associated with the resources allocated to an already-running job, which can be useful for monitoring and debugging jobs. This is done by specifying the job ID of a currently running job using the --jobid
option, e.g.
Running an interactive shell in an existing job¶
You are able to interact with an existing batch job using srun
. First let us submit a job and check it's running using squeue
:
$ sbatch script.batch
Submitted batch job 23379
$ squeue --me
JOBID USER PARTITION NAME ST TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
23379 user.project grace script.batch R UNLIMITED 0:02 UNLIMITED 1 x3004c0s9b4n0
Then start an interactive shell in a job step using the job's allocated resources, we then run some example commands:
$ srun --ntasks=1 --jobid=23379 --pty /bin/bash -l
$ hostname
x3004c0s9b4n0
$ numactl -s
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
$ exit
logout
After exiting the interactive shell, the original job continues running
$ squeue --me
JOBID USER PARTITION NAME ST TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
23379 user.project grace script.batch R UNLIMITED 1:46 UNLIMITED 1 x3004c0s9b4n0
Allocating a compute node as a job¶
Similarly, salloc
can be used to reserve compute node resources, and then srun
can be used to run jobs on the requested resource interactively:
$ salloc --time=00:2:00
salloc: Granted job allocation 19130
$ srun hostname
x3004c0s9b4n0
$ srun numactl -s
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
Fair Use of Resources
Always set a time limit when using salloc
to run jobs interactively, and cancel the job using scancel <JOB_ID>
when you have finished (replace <JOB_ID>
with the job ID of the allocation).
SQUEUE, SACCT, SCANCEL: Managing jobs¶
The squeue
command shows the jobs running on the system, combine with --me
flag to see just your own jobs that are currently running:
$ squeue --me
JOBID USER PARTITION NAME ST TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
19132 user.project workq interactive R 2:00 0:53 1:07 1 nid001038
The sacct
command shows your current and previously completed jobs:
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ------------ ---------- ------------ --------
19107 nvtest workq project 144 COMPLETED 0:0
19107.batch batch project 144 COMPLETED 0:0
19107.0 nvidia-smi project 144 FAILED 2:0
19114 pysrun workq project 72 COMPLETED 0:0
19114.batch batch project 72 COMPLETED 0:0
19114.0 python3 project 1 COMPLETED 0:0
19114.1 python3 project 1 COMPLETED 0:0
19114.2 python3 project 1 COMPLETED 0:0
19130 interacti+ workq project 72 CANCELLED 0:0
19132 pysrun2 workq project 72 RUNNING 0:0
19132.batch batch project 72 RUNNING 0:0
19132.0 python3 project 1 RUNNING 0:0
scancel
is used to cancel a job. E.g. to cancel a job with the ID 19132:
$ scancel 19132