Archived issues¶
Note
The below issues are resolved. If you still face them please get in touch with the BriCS team.
BriCS Services and Facilities¶
GPU specific environment variables¶
Last updated: 2026-01-08 (Archived)
- Services affected
-
Isambard 3 MACS, Isambard-AI Phase 1, Isambard-AI Phase 2
- Description
-
Slurm is configured to setup access to GPUs but has not been told the exact type of GPU (Nvidia, AMD, Intel etc.) and therefore sets many GPU specific environment variables for all types, such as
ROCR_VISIBLE_DEVICES,ZE_AFFINITY_MASK,GPU_DEVICE_ORDINAL. If your code depends on these variables to detect the type of GPU it may get confused. - Workaround
-
Unsetting the variables before running your code should work. This will be fixed at a future update.
Pytorch Distributed Tutorial Errors¶
Last updated: 2025-10-20 (Archived)
- Services affected
-
Isambard-AI Phase 1
- Description
-
The Distributed PyTorch Training tutorial is currently experiencing compatibility issues on Isambard-AI.
- Workaround
-
No workaround available.
We are actively working to resolve these problems. Please check back for updates.
Access issues to Isambard-AI Phase 1¶
Last updated: 2025-08-27 (Archived)
- Services affected
-
Isambard-AI Phase 1
- Description
-
Isambard-AI Phase 1 is currently unable to be accessed and are undergoing investigations.
- Workaround
-
No workaround available.
Multi-node Podman-HPC error: "Permission denied"¶
Last updated: 2025-08-07
- Services affected
-
Isambard-AI Phase 1, Isambard 3
- Description
-
Podman-HPC can sometimes get into a bad configuration when used for multi-node workloads (MPI, NCCL), resulting in errors of the form
Permission denied: '/local/user/<user-id>/storage/overlay/<HASH>'. This can occur due to issues with user namespace mapping, used bypodman-hpcto allow containers to run rootless. See NERSC/podman-hpc issue #116. - Workaround
-
Use
podman-hpc unshareto enter a user namespace, then delete the directory$LOCALDIR/storage/overlay/<HASH>(with<HASH>as shown in the error message), and the files$LOCALDIR/storage/overlay-images/images.jsonand$LOCALDIR/storage/overlay-layers/layers.json. If the image has been migrated then the corresponding directories and files under$SCRATCHDIRwill also need to be deleted. IfPermission deniedorFileNotFoundErrorissues continue, follow steps in the the podman-hpc Troubleshooting guide to clear stored data and resetpodman-hpc.
Isambard 3 MACS "hopper" partition in extended maintenance¶
Last updated: 2025-06-27 (Archived)
- Description
-
The "hopper" partition in the MACS cluster is currently undergoing a long maintenance session due to a hardware fault with the node and currently awaiting diagnosis and parts before returning to service.
- Workaround
-
If possible, please use alternative GPU partition.
Third Party¶
Singularity internal error when pulling images from nvcr.io¶
Last updated: 2025-06-25 (Archived)
- Description
-
There is a known issue with Go (a singularity dependency) and the download of large images from the NVIDIA Container Registry nvcr.io. On failure to pull the image, the error
stream error: stream ID <NUM>; INTERNAL_ERROR; received from peeris thrown. - Workaround
-
Temporarily disable HTTP2 using
export GODEBUG=http2client=0