Singularity Multi-node¶
To run singularity containers on multiple nodes, you will need to load the following module:
$ module load brics/singularity-multi-node
This module will populate the SINGULARITY_BINDPATH
environment variable to bind the required host dependencies into the container. Inside your container, you must source /host/adapt.sh
for the changes to take place. Here is an example of how to do this with an ubuntu container:
$ singularity pull docker://ubuntu:latest
$ singularity run --nv ubuntu_latest.sif
Singularity> source /host/adapt.sh
This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.
Singularity>
Existing runscripts
If your container already has an existing runscript/entrypoint, this can be executed as so:
Singularity> /.singularity.d/runscript
Using PMIX
The current OpenMPI on the system is only built with PMIX support. When you execute jobs with containers you must use srun --mpi=pmix
. Please see the MPI documentation.
Example: mpicc
and ompi_info
¶
The MPI C Compiler (mpicc
) wrapper is configured inside a multi-node container. Next we will check which mpicc
is being used and verify the installation prefix. We show you how to do this using the singularity exec
command:
$ singularity exec --nv ubuntu_latest.sif /host/adapt.sh
This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.
Singularity> which mpicc
/host/ompi/bin/mpicc
Singularity> mpicc -show
/usr/bin/gcc -I/host/ompi/include -pthread -L/host/ompi/lib -Wl,-rpath -Wl,/host/ompi/lib -Wl,--enable-new-dtags -lmpi
Singularity> ompi_info
...
This verifies that mpicc
is correctly including the header files and linking against the appropriate libraries. ompi_info
can also show you more about how OpenMPI has been built.
Note that you will have to include your own C compiler (e.g. gcc
) or use a container with a compiler suite built-in.
Example: Running nccl-tests
in singularity¶
We will now run nccl-tests
to show how to get the full bandwidth of the interconnect. For this we will use the nccl-tests repository and the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compiler suite.
Pulling from Nvidia GPU Cloud (NGC)
Singularity has a known issue when pulling from NGC. You can export GODEBUG=http2client=0
as a workaround. Please see the known issues section.
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ singularity pull docker://nvcr.io/nvidia/pytorch:24.04-py3
$ singularity exec --nv --bind $TMPDIR pytorch_24.04-py3.sif /host/adapt.sh bash
Singularity>
Inside the container we will now enter the nccl-tests directory and build using the host OpenMPI:
Singularity> cd nccl-tests/
Singularity> make -j 72 MPI=1 MPI_HOME=/host/ompi
[build output...]
Singularity> exit
$ srun --cpus-per-task=72 --ntasks-per-node=1 --gpus=8 --mpi=pmix singularity exec --nv pytorch_24.04-py3.sif /host/adapt.sh nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4
srun: job 23371 queued and waiting for resources
srun: job 23371 has been allocated resources
This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 178228 on nid001038 device 0 [0x01] GH200 120GB
# Rank 1 Group 0 Pid 178228 on nid001038 device 1 [0x01] GH200 120GB
# Rank 2 Group 0 Pid 178228 on nid001038 device 2 [0x01] GH200 120GB
# Rank 3 Group 0 Pid 178228 on nid001038 device 3 [0x01] GH200 120GB
# Rank 4 Group 0 Pid 172916 on nid001039 device 0 [0x01] GH200 120GB
# Rank 5 Group 0 Pid 172916 on nid001039 device 1 [0x01] GH200 120GB
# Rank 6 Group 0 Pid 172916 on nid001039 device 2 [0x01] GH200 120GB
# Rank 7 Group 0 Pid 172916 on nid001039 device 3 [0x01] GH200 120GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
32768 8192 float sum -1 47.39 0.69 1.21 0 48.15 0.68 1.19 0
65536 16384 float sum -1 59.05 1.11 1.94 0 50.88 1.29 2.25 0
131072 32768 float sum -1 75.48 1.74 3.04 0 66.59 1.97 3.44 0
262144 65536 float sum -1 226.5 1.16 2.03 0 329.0 0.80 1.39 0
524288 131072 float sum -1 1385.6 0.38 0.66 0 1387.2 0.38 0.66 0
1048576 262144 float sum -1 5846.0 0.18 0.31 0 5731.8 0.18 0.32 0
2097152 524288 float sum -1 989.7 2.12 3.71 0 989.3 2.12 3.71 0
4194304 1048576 float sum -1 995.7 4.21 7.37 0 988.5 4.24 7.43 0
8388608 2097152 float sum -1 1328.3 6.32 11.05 0 1327.3 6.32 11.06 0
16777216 4194304 float sum -1 1350.0 12.43 21.75 0 1345.6 12.47 21.82 0
33554432 8388608 float sum -1 1404.1 23.90 41.82 0 1400.3 23.96 41.93 0
67108864 16777216 float sum -1 1059.0 63.37 110.89 0 1057.4 63.47 111.07 0
134217728 33554432 float sum -1 1813.2 74.02 129.54 0 1815.5 73.93 129.38 0
268435456 67108864 float sum -1 3292.7 81.52 142.67 0 3291.6 81.55 142.71 0
536870912 134217728 float sum -1 6212.1 86.42 151.24 0 6214.2 86.39 151.19 0
1073741824 268435456 float sum -1 12045 89.14 156.00 0 12047 89.13 155.98 0
2147483648 536870912 float sum -1 23730 90.50 158.37 0 23710 90.57 158.50 0
4294967296 1073741824 float sum -1 47081 91.22 159.64 0 47097 91.19 159.59 0
8589934592 2147483648 float sum -1 93896 91.48 160.10 0 93891 91.49 160.10 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 66.5021
Under Bus Bandwidth (busbw) we can get 160.10 GB/s between 8 gpus placed over 2 nodes.
Using PMIX with torchrun
When using PMIX as the process management interface - which PyTorch does not detect, the adapt.sh
will set LOCAL_RANK=$PMIX_RANK
.
PyTorch's torchrun
can also be run with the --node_rank
option explicitly set, i.e. torchrun --node_rank=$PMIX_RANK
.