Skip to content

Job scheduling

Overview

The scheduling and allocation of resources to compute jobs on BriCS compute services (Isambard-AI Phase 1, Isambard 3, etc.) are managed by the Slurm workload manager.

The configuration of the workload manager controls how jobs are scheduled, how resources are shared, and how resource limits are imposed. This configuration is tailored to each BriCS compute service based on the compute resources offered, the expected usage profile of the service, and the principles described in the Resource Management Model.

Key information about the configuration of the workload manager for each compute service is summarized below. For information on how to submit and manage jobs, see the Slurm Job Management guide.

User and project limits

The following resource limits are effective at the user and project level.

Resource Limit Value Applies to QoS name Notes
Max GPUs allocated 32 Project 32gpu_qos Maximum gres/gpu resource allocated to all jobs associated with a project
Resource Limit Value Applies to QoS name Notes
Max nodes allocated 64 User grace_qos Maximum number of compute nodes allocated to a user's jobs
Max queued jobs 50 User grace_qos Maximum number of jobs pending or running for a user
Max nodes allocated 128 Project grace_qos Maximum number of compute nodes allocated to all jobs associated with a project
Max queued jobs 200 Project grace_qos Maximum number of jobs pending or running associated with a project
Resource Limit Value Applies to QoS name Notes
Max GPUs allocated 2 Project macs_qos Maximum gres/gpu resource allocated to all jobs associated with a project
Max queued jobs 20 Project macs_qos Maximum number of jobs pending or running associated with a project

Partition configuration

Partition name User accessible QoS name Nodes Maximum walltime Notes
workq (*) workq_qos 38 × 4 GH200 node 24h For general purpose AI/ML workloads
bricsonly N/A 1 × 4 GH200 node N/A Test partition for BriCS administrators
interactive N/A 1 × 4 GH200 node N/A Test partition for BriCS administrators
Partition name User accessible QoS name Nodes Maximum walltime Notes
grace (*) grace_qos 384 × 2 Grace CPU superchip node 24h For general purpose CPU workloads
Partition name User accessible QoS name Nodes Maximum walltime
milan macs_qos 12 × AMD Milan CPU node 24h
genoa macs_qos 2 × AMD Genoa CPU node 24h
berg macs_qos 2 × AMD Bergamo CPU node 24h
spr macs_qos 2 × Intel Sapphire Rapids CPU node 24h
sprhbm macs_qos 2 × Intel Sapphire Rapids CPU node (HBM) 24h
ampere macs_qos 2 × AMD Milan CPU + 4 A100 GPU node 24h
hopper macs_qos 1 × AMD Milan CPU + 4 H100 GPU node 24h
instinct macs_qos 2 × AMD Milan CPU + 4 MI100 GPU node 24h

Multi-Architecture Comparison System (MACS)

The Multi-Architecture Comparison System (MACS) comprises small numbers of compute nodes of varying architectures. The system is not suitable for production workloads, and is intended for use to research, evaluate, and compare different node architectures.

(*) Denotes the default partition that jobs are submitted to if no partition is specified.