Job scheduling
Overview¶
The scheduling and allocation of resources to compute jobs on BriCS compute services (Isambard-AI Phase 1, Isambard 3, etc.) are managed by the Slurm workload manager.
The configuration of the workload manager controls how jobs are scheduled, how resources are shared, and how resource limits are imposed. This configuration is tailored to each BriCS compute service based on the compute resources offered, the expected usage profile of the service, and the principles described in the Resource Management Model.
Key information about the configuration of the workload manager for each compute service is summarized below. For information on how to submit and manage jobs, see the Slurm Job Management guide.
User and project limits¶
The following resource limits are effective at the user and project level.
Resource Limit | Value | Applies to | QoS name | Notes |
---|---|---|---|---|
Max GPUs allocated | 32 | Project | 32gpu_qos |
Maximum gres/gpu resource allocated to all jobs associated with a project |
Resource Limit | Value | Applies to | QoS name | Notes |
---|---|---|---|---|
Max nodes allocated | 64 | User | grace_qos |
Maximum number of compute nodes allocated to a user's jobs |
Max queued jobs | 50 | User | grace_qos |
Maximum number of jobs pending or running for a user |
Max nodes allocated | 128 | Project | grace_qos |
Maximum number of compute nodes allocated to all jobs associated with a project |
Max queued jobs | 200 | Project | grace_qos |
Maximum number of jobs pending or running associated with a project |
Resource Limit | Value | Applies to | QoS name | Notes |
---|---|---|---|---|
Max GPUs allocated | 2 | Project | macs_qos |
Maximum gres/gpu resource allocated to all jobs associated with a project |
Max queued jobs | 20 | Project | macs_qos |
Maximum number of jobs pending or running associated with a project |
Partition configuration¶
Partition name | User accessible | QoS name | Nodes | Maximum walltime | Notes |
---|---|---|---|---|---|
workq (*) | workq_qos |
38 × 4 GH200 node | 24h | For general purpose AI/ML workloads | |
bricsonly | N/A | 1 × 4 GH200 node | N/A | Test partition for BriCS administrators | |
interactive | N/A | 1 × 4 GH200 node | N/A | Test partition for BriCS administrators |
Partition name | User accessible | QoS name | Nodes | Maximum walltime | Notes |
---|---|---|---|---|---|
grace (*) | grace_qos |
384 × 2 Grace CPU superchip node | 24h | For general purpose CPU workloads |
Partition name | User accessible | QoS name | Nodes | Maximum walltime |
---|---|---|---|---|
milan | macs_qos |
12 × AMD Milan CPU node | 24h | |
genoa | macs_qos |
2 × AMD Genoa CPU node | 24h | |
berg | macs_qos |
2 × AMD Bergamo CPU node | 24h | |
spr | macs_qos |
2 × Intel Sapphire Rapids CPU node | 24h | |
sprhbm | macs_qos |
2 × Intel Sapphire Rapids CPU node (HBM) | 24h | |
ampere | macs_qos |
2 × AMD Milan CPU + 4 A100 GPU node | 24h | |
hopper | macs_qos |
1 × AMD Milan CPU + 4 H100 GPU node | 24h | |
instinct | macs_qos |
2 × AMD Milan CPU + 4 MI100 GPU node | 24h |
Multi-Architecture Comparison System (MACS)
The Multi-Architecture Comparison System (MACS) comprises small numbers of compute nodes of varying architectures. The system is not suitable for production workloads, and is intended for use to research, evaluate, and compare different node architectures.
(*) Denotes the default partition that jobs are submitted to if no partition is specified.