Singularity Multi-node

To run singularity containers on multiple nodes, you will need to load the following module:

$ module load brics/singularity-multi-node

This module will populate the SINGULARITY_BINDPATH environment variable to bind the required host dependencies into the container. Inside your container, you must source /host/ for the changes to take place. Here is an example of how to do this with an ubuntu container:

$ singularity pull docker://ubuntu:latest
$ singularity run --nv ubuntu_latest.sif 
Singularity> source /host/

The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.


Existing runscripts

If your container already has an existing runscript/entrypoint, this can be executed as so:

Singularity> /.singularity.d/runscript

Using PMIX

The current OpenMPI on the system is only built with PMIX support. When you execute jobs with containers you must use srun --mpi=pmix. Please see the MPI documentation.

Example: mpicc and ompi_info

The MPI C Compiler (mpicc) wrapper is configured inside a multi-node container. Next we will check which mpicc is being used and verify the installation prefix. We show you how to do this using the singularity exec command:

$ singularity exec --nv ubuntu_latest.sif /host/

Singularity> which mpicc
Singularity> mpicc -show
/usr/bin/gcc -I/host/ompi/include -pthread -L/host/ompi/lib -Wl,-rpath -Wl,/host/ompi/lib -Wl,--enable-new-dtags -lmpi
Singularity> ompi_info

This verifies that mpicc is correctly including the header files and linking against the appropriate libraries. ompi_info can also show you more about how OpenMPI has been built. Note that you will have to include your own C compiler (e.g. gcc) or use a container with a compiler suite built-in.

Example: Running nccl-tests in singularity

We will now run nccl-tests to show how to get the full bandwidth of the interconnect. For this we will use the nccl-tests repository and the NGC container to have access to its compiler suite.

Pulling from Nvidia GPU Cloud (NGC)

Singularity has a known issue when pulling from NGC. You can export GODEBUG=http2client=0 as a workaround. Please see the known issues section.

$ git clone
$ singularity pull docker://
$ singularity exec --nv --bind $TMPDIR pytorch_24.04-py3.sif /host/ bash

Inside the container we will now enter the nccl-tests directory and build using the host OpenMPI:

Singularity> cd nccl-tests/
Singularity> make -j 72 MPI=1 MPI_HOME=/host/ompi
[build output...]
Singularity> exit
$ srun --cpus-per-task=72 --ntasks-per-node=1 --gpus=8 --mpi=pmix singularity exec --nv pytorch_24.04-py3.sif /host/ nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4

srun: job 23371 queued and waiting for resources
srun: job 23371 has been allocated resources

# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
# Using devices
#  Rank  0 Group  0 Pid 178228 on  nid001038 device  0 [0x01] GH200 120GB
#  Rank  1 Group  0 Pid 178228 on  nid001038 device  1 [0x01] GH200 120GB
#  Rank  2 Group  0 Pid 178228 on  nid001038 device  2 [0x01] GH200 120GB
#  Rank  3 Group  0 Pid 178228 on  nid001038 device  3 [0x01] GH200 120GB
#  Rank  4 Group  0 Pid 172916 on  nid001039 device  0 [0x01] GH200 120GB
#  Rank  5 Group  0 Pid 172916 on  nid001039 device  1 [0x01] GH200 120GB
#  Rank  6 Group  0 Pid 172916 on  nid001039 device  2 [0x01] GH200 120GB
#  Rank  7 Group  0 Pid 172916 on  nid001039 device  3 [0x01] GH200 120GB
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       32768          8192     float     sum      -1    47.39    0.69    1.21      0    48.15    0.68    1.19      0
       65536         16384     float     sum      -1    59.05    1.11    1.94      0    50.88    1.29    2.25      0
      131072         32768     float     sum      -1    75.48    1.74    3.04      0    66.59    1.97    3.44      0
      262144         65536     float     sum      -1    226.5    1.16    2.03      0    329.0    0.80    1.39      0
      524288        131072     float     sum      -1   1385.6    0.38    0.66      0   1387.2    0.38    0.66      0
     1048576        262144     float     sum      -1   5846.0    0.18    0.31      0   5731.8    0.18    0.32      0
     2097152        524288     float     sum      -1    989.7    2.12    3.71      0    989.3    2.12    3.71      0
     4194304       1048576     float     sum      -1    995.7    4.21    7.37      0    988.5    4.24    7.43      0
     8388608       2097152     float     sum      -1   1328.3    6.32   11.05      0   1327.3    6.32   11.06      0
    16777216       4194304     float     sum      -1   1350.0   12.43   21.75      0   1345.6   12.47   21.82      0
    33554432       8388608     float     sum      -1   1404.1   23.90   41.82      0   1400.3   23.96   41.93      0
    67108864      16777216     float     sum      -1   1059.0   63.37  110.89      0   1057.4   63.47  111.07      0
   134217728      33554432     float     sum      -1   1813.2   74.02  129.54      0   1815.5   73.93  129.38      0
   268435456      67108864     float     sum      -1   3292.7   81.52  142.67      0   3291.6   81.55  142.71      0
   536870912     134217728     float     sum      -1   6212.1   86.42  151.24      0   6214.2   86.39  151.19      0
  1073741824     268435456     float     sum      -1    12045   89.14  156.00      0    12047   89.13  155.98      0
  2147483648     536870912     float     sum      -1    23730   90.50  158.37      0    23710   90.57  158.50      0
  4294967296    1073741824     float     sum      -1    47081   91.22  159.64      0    47097   91.19  159.59      0
  8589934592    2147483648     float     sum      -1    93896   91.48  160.10      0    93891   91.49  160.10      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 66.5021 

Under Bus Bandwidth (busbw) we can get 160.10 GB/s between 8 gpus placed over 2 nodes.

Using PMIX with torchrun

When using PMIX as the process management interface - which PyTorch does not detect, the will set LOCAL_RANK=$PMIX_RANK. PyTorch's torchrun can also be run with the --node_rank option explicitly set, i.e. torchrun --node_rank=$PMIX_RANK.
