Archived issues¶
Note
The below issues are resolved. If you still face them please get in touch with the BriCS team.
BriCS Services and Facilities¶
Pytorch Distributed Tutorial Errors¶
Last updated: 2025-10-20 (Archived)
- Services affected
-
Isambard-AI Phase 1
- Description
-
The Distributed PyTorch Training tutorial is currently experiencing compatibility issues on Isambard-AI.
- Workaround
-
No workaround available.
We are actively working to resolve these problems. Please check back for updates.
Access issues to Isambard-AI Phase 1¶
Last updated: 2025-08-27 (Archived)
- Services affected
-
Isambard-AI Phase 1
- Description
-
Isambard-AI Phase 1 is currently unable to be accessed and are undergoing investigations.
- Workaround
-
No workaround available.
Multi-node Podman-HPC error: "Permission denied"¶
Last updated: 2025-08-07
- Services affected
-
Isambard-AI Phase 1, Isambard 3
- Description
-
Podman-HPC can sometimes get into a bad configuration when used for multi-node workloads (MPI, NCCL), resulting in errors of the form
Permission denied: '/local/user/<user-id>/storage/overlay/<HASH>'. This can occur due to issues with user namespace mapping, used bypodman-hpcto allow containers to run rootless. See NERSC/podman-hpc issue #116. - Workaround
-
Use
podman-hpc unshareto enter a user namespace, then delete the directory$LOCALDIR/storage/overlay/<HASH>(with<HASH>as shown in the error message), and the files$LOCALDIR/storage/overlay-images/images.jsonand$LOCALDIR/storage/overlay-layers/layers.json. If the image has been migrated then the corresponding directories and files under$SCRATCHDIRwill also need to be deleted. IfPermission deniedorFileNotFoundErrorissues continue, follow steps in the the podman-hpc Troubleshooting guide to clear stored data and resetpodman-hpc.
Isambard 3 MACS "hopper" partition in extended maintenance¶
Last updated: 2025-06-27 (Archived)
- Description
-
The "hopper" partition in the MACS cluster is currently undergoing a long maintenance session due to a hardware fault with the node and currently awaiting diagnosis and parts before returning to service.
- Workaround
-
If possible, please use alternative GPU partition.
Third Party¶
Singularity internal error when pulling images from nvcr.io¶
Last updated: 2025-06-25 (Archived)
- Description
-
There is a known issue with Go (a singularity dependency) and the download of large images from the NVIDIA Container Registry nvcr.io. On failure to pull the image, the error
stream error: stream ID <NUM>; INTERNAL_ERROR; received from peeris thrown. - Workaround
-
Temporarily disable HTTP2 using
export GODEBUG=http2client=0