Job scheduling
Overview¶
The scheduling and allocation of resources to compute jobs on BriCS compute services (Isambard-AI Phase 1, Isambard-AI Phase 2, Isambard 3, etc.) are managed by the Slurm workload manager.
The configuration of the workload manager controls how jobs are scheduled, how resources are shared, and how resource limits are imposed. This configuration is tailored to each BriCS compute service based on the compute resources offered, the expected usage profile of the service, and the principles described in the Resource Management Model.
Key information about the configuration of the workload manager for each compute service is summarized below. For information on how to submit and manage jobs, see the Slurm Job Management guide.
User and project limits¶
The following resource limits are effective at the user and project level.
Why do these limits exist?
Per-user and per-project limits may be imposed to prevent any single user or project from monopolising the scheduler queue, which would block others from running jobs. Any such limits in place are reviewed regularly and may be adjusted as the system evolves and usage patterns change. At times, temporary tighter limits may be imposed to ensure the integrity of the service — check the service status page for current restrictions.
| Resource Limit | Value | Applies to | QoS name | Notes |
|---|---|---|---|---|
| Max GPUs allocated | 32 | Project | 32gpu_qos |
Maximum gres/gpu resource allocated to all jobs associated with a project |
| Resource Limit | Value | Applies to | QoS name | Notes |
|---|---|---|---|---|
| None |
Limits under review
Isambard 3 user and project resource limits have been temporarily relaxed while they are under review.
| Resource Limit | Value | Applies to | QoS name | Notes |
|---|---|---|---|---|
| Max queued jobs | 1000 | User | grace_qos |
Maximum number of jobs pending or running for a user |
| Resource Limit | Value | Applies to | QoS name | Notes |
|---|---|---|---|---|
| Max GPUs allocated | 2 | Project | macs_qos |
Maximum gres/gpu resource allocated to all jobs associated with a project |
| Max queued jobs | 20 | Project | macs_qos |
Maximum number of jobs pending or running associated with a project |
Partition configuration¶
| Partition name | User accessible | QoS name | Nodes | Maximum walltime | Notes |
|---|---|---|---|---|---|
| workq (*) | workq_qos |
38 × 4 GH200 node | 24h | For general purpose AI/ML workloads | |
| bricsonly | N/A | 2 × 4 GH200 node | N/A | Test partition for BriCS administrators |
| Partition name | User accessible | QoS name | Nodes | Maximum walltime | Notes |
|---|---|---|---|---|---|
| workq (*) | workq_qos |
1320 × 4 GH200 node | 24h | For general purpose AI/ML workloads |
| Partition name | User accessible | QoS name | Nodes | Maximum walltime | Notes |
|---|---|---|---|---|---|
| grace (*) | grace_qos |
384 × 2 Grace CPU superchip node | 24h | For general purpose CPU workloads |
| Partition name | User accessible | QoS name | Nodes | Maximum walltime |
|---|---|---|---|---|
| milan | macs_qos |
12 × AMD Milan CPU node | 24h | |
| genoa | macs_qos |
2 × AMD Genoa CPU node | 24h | |
| berg | macs_qos |
2 × AMD Bergamo CPU node | 24h | |
| spr | macs_qos |
2 × Intel Sapphire Rapids CPU node | 24h | |
| sprhbm | macs_qos |
2 × Intel Sapphire Rapids CPU node (HBM) | 24h | |
| ampere | macs_qos |
2 × AMD Milan CPU + 4 A100 GPU node | 24h | |
| hopper | macs_qos |
1 × AMD Milan CPU + 4 H100 GPU node | 24h | |
| instinct | macs_qos |
2 × AMD Milan CPU + 4 MI100 GPU node | 24h |
Multi-Architecture Comparison System (MACS)
The Multi-Architecture Comparison System (MACS) comprises small numbers of compute nodes of varying architectures. The system is not suitable for production workloads, and is intended for use to research, evaluate, and compare different node architectures.
(*) Denotes the default partition that jobs are submitted to if no partition is specified.
Why is the maximum walltime 24 hours?
The 24-hour limit helps ensure fair scheduling across all users. Jobs that run for longer periods hold nodes that cannot be reassigned to other work, making it harder for the scheduler to backfill shorter jobs and increasing queue times for everyone. The limit also allows the operations team to apply system maintenance and security patches regularly without needing to forcibly terminate long-running jobs. It also protects your work: jobs running for days without checkpointing are vulnerable to any system interruption.
Workloads requiring longer than 24 hours should implement checkpointing to enable jobs to resume from where they left off. See the Slurm guide for information on job dependencies and resubmission.