Job scheduling

Overview¶

The scheduling and allocation of resources to compute jobs on BriCS compute services (Isambard-AI Phase 1, Isambard 3, etc.) are managed by the Slurm workload manager.

The configuration of the workload manager controls how jobs are scheduled, how resources are shared, and how resource limits are imposed. This configuration is tailored to each BriCS compute service based on the compute resources offered, the expected usage profile of the service, and the principles described in the Resource Management Model.

Key information about the configuration of the workload manager for each compute service is summarized below. For information on how to submit and manage jobs, see the Slurm Job Management guide.

User and project limits¶

The following resource limits are effective at the user and project level.

Isambard-AI Phase 1Isambard 3 GraceIsambard 3 MACS

Resource Limit	Value	Applies to	QoS name	Notes
Max GPUs allocated	32	Project	`32gpu_qos`	Maximum `gres/gpu` resource allocated to all jobs associated with a project

Limits under review

Isambard 3 user and project resource limits have been temporarily relaxed while they are under review.

Resource Limit	Value	Applies to	QoS name	Notes
Max queued jobs	1000	User	`grace_qos`	Maximum number of jobs pending or running for a user

Resource Limit	Value	Applies to	QoS name	Notes
Max GPUs allocated	2	Project	`macs_qos`	Maximum `gres/gpu` resource allocated to all jobs associated with a project
Max queued jobs	20	Project	`macs_qos`	Maximum number of jobs pending or running associated with a project

Partition configuration¶

Isambard-AI Phase 1Isambard 3 GraceIsambard 3 MACS

Partition name	QoS name	Nodes	Maximum walltime	Notes
workq (*)	`workq_qos`	38 × 4 GH200 node	24h	For general purpose AI/ML workloads
bricsonly	N/A	1 × 4 GH200 node	N/A	Test partition for BriCS administrators
interactive	N/A	1 × 4 GH200 node	N/A	Test partition for BriCS administrators

Partition name	User accessible	QoS name	Nodes	Maximum walltime	Notes
grace (*)		`grace_qos`	384 × 2 Grace CPU superchip node	24h	For general purpose CPU workloads

Partition name	QoS name	Nodes	Maximum walltime
milan	`macs_qos`	12 × AMD Milan CPU node	24h
genoa	`macs_qos`	2 × AMD Genoa CPU node	24h
berg	`macs_qos`	2 × AMD Bergamo CPU node	24h
spr	`macs_qos`	2 × Intel Sapphire Rapids CPU node	24h
sprhbm	`macs_qos`	2 × Intel Sapphire Rapids CPU node (HBM)	24h
ampere	`macs_qos`	2 × AMD Milan CPU + 4 A100 GPU node	24h
hopper	`macs_qos`	1 × AMD Milan CPU + 4 H100 GPU node	24h
instinct	`macs_qos`	2 × AMD Milan CPU + 4 MI100 GPU node	24h

Multi-Architecture Comparison System (MACS)

The Multi-Architecture Comparison System (MACS) comprises small numbers of compute nodes of varying architectures. The system is not suitable for production workloads, and is intended for use to research, evaluate, and compare different node architectures.

(*) Denotes the default partition that jobs are submitted to if no partition is specified.