Skip to content

Podman-HPC Multi-node

podman-hpc has been set-up with 2 extra flags:

  1. --mpi-trial This adds OpenMPI and PMIx to the container.
  2. --nccl-trial This adds NCCL, OpenMPI, and PMIx.

These flags will mount the required host dependencies into the container. An entrypoint script will also set the environment.

Below are examples that show how to use these flags in conjunction with nccl-tests (for NCCL / AI workloads).

Entrypoints

The podman-hpc flags will modify the entrypoint of the container to execute /host/adapt.sh. You will have to execute your entrypoint manually after starting the container for it to run. If you would like to not run the adapt.sh entrypoint you can set it as empty podman-hpc run --entrypoint= <CONTAINER_IMAGE>.

Using PMIx

The current OpenMPI on the system is only built with PMIx support. When you execute jobs with containers you must use srun --mpi=pmix. Please see the MPI documentation.

Example: Running nccl-tests in podman-hpc

Here shown is an example of how to build nccl-tests on an NGC container without rebuilding the container itself. This is built using the host's MPI. We use the nccl-tests repository and the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compiler suite. First nccl-tests is cloned, and the current working directories are mounted in the container for the build to persist.

$ podman-hpc pull nvcr.io/nvidia/pytorch:24.04-py3
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ podman-hpc run -it --rm --nccl-trial -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 bash

This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.

Please read the relevant documentation for using containers on multi-node jobs:
https://docs.isambard.ac.uk/user-documentation/guides/containers-multi-node/

user.project@nid001040:~$ make -j 32 MPI=1 MPI_HOME=/host/ompi NCCL_HOME=/host/nccl CUDA_HOME=${CUDA_HOME}
[build output...]
user.project@nid001040:~$ exit

In the execution of the container with podman run note the use of the following flags:

  1. -it Run interactively.
  2. --rm Remove the container after exiting.
  3. --nccl-trial Add NCCL, OpenMPI and PMIx to the container.
  4. -v $PWD:$PWD Mount the current working directory in the same path.
  5. -w $PWD Set the same working directory inside the container.
  6. bash This has to be added for the entrypoint script to execute interactively.

After exiting the container we can now srun to dispatch our job. To execute the benchmarks we must add --mpi=pmix for Slurm to interface using PMIx, and the --mpi-trial flag to podman-hpc:

$ cd nccl-tests
$ export ALL_REDUCE_BIN=$PWD/build/all_reduce_perf
$ srun -N 2 --gpus=8 --mpi=pmix --cpus-per-task 72 --ntasks-per-node 1 --mpi=pmix \
podman-hpc run --nccl-trial --gpu -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 \
$ALL_REDUCE_BIN -b 32KB -e 8GB -f 2 -g 4
srun: job 23916 queued and waiting for resources
srun: job 23916 has been allocated resources

This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.

Please read the relevant documentation for using containers on multi-node jobs:
https://docs.isambard.ac.uk/user-documentation/guides/containers-multi-node/

# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 242178 on  nid001007 device  0 [0x01] GH200 120GB
#  Rank  1 Group  0 Pid 242178 on  nid001007 device  1 [0x01] GH200 120GB
#  Rank  2 Group  0 Pid 242178 on  nid001007 device  2 [0x01] GH200 120GB
#  Rank  3 Group  0 Pid 242178 on  nid001007 device  3 [0x01] GH200 120GB
#  Rank  4 Group  0 Pid  77688 on  nid001008 device  0 [0x01] GH200 120GB
#  Rank  5 Group  0 Pid  77688 on  nid001008 device  1 [0x01] GH200 120GB
#  Rank  6 Group  0 Pid  77688 on  nid001008 device  2 [0x01] GH200 120GB
#  Rank  7 Group  0 Pid  77688 on  nid001008 device  3 [0x01] GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       32768          8192     float     sum      -1    47.55    0.69    1.21      0    42.83    0.77    1.34      0
       65536         16384     float     sum      -1    58.63    1.12    1.96      0    57.88    1.13    1.98      0
      131072         32768     float     sum      -1    80.02    1.64    2.87      0    57.96    2.26    3.96      0
      262144         65536     float     sum      -1    326.2    0.80    1.41      0    322.6    0.81    1.42      0
      524288        131072     float     sum      -1   1666.9    0.31    0.55      0   1685.7    0.31    0.54      0
     1048576        262144     float     sum      -1   6373.3    0.16    0.29      0   6020.4    0.17    0.30      0
     2097152        524288     float     sum      -1   1004.1    2.09    3.65      0   1009.2    2.08    3.64      0
     4194304       1048576     float     sum      -1   1013.7    4.14    7.24      0   1015.6    4.13    7.23      0
     8388608       2097152     float     sum      -1   1347.2    6.23   10.90      0   1351.1    6.21   10.87      0
    16777216       4194304     float     sum      -1   1368.8   12.26   21.45      0   1364.8   12.29   21.51      0
    33554432       8388608     float     sum      -1   1420.4   23.62   41.34      0   1409.5   23.81   41.66      0
    67108864      16777216     float     sum      -1   1057.0   63.49  111.11      0   1051.7   63.81  111.67      0
   134217728      33554432     float     sum      -1   1820.7   73.72  129.00      0   1812.2   74.06  129.61      0
   268435456      67108864     float     sum      -1   3298.4   81.38  142.42      0   3293.8   81.50  142.62      0
   536870912     134217728     float     sum      -1   6214.5   86.39  151.18      0   6212.5   86.42  151.23      0
  1073741824     268435456     float     sum      -1    12055   89.07  155.87      0    12051   89.10  155.93      0
  2147483648     536870912     float     sum      -1    23709   90.58  158.51      0    23727   90.51  158.39      0
  4294967296    1073741824     float     sum      -1    47087   91.21  159.62      0    47080   91.23  159.65      0
  8589934592    2147483648     float     sum      -1    93892   91.49  160.10      0    93863   91.52  160.15      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 66.4308 
#

Under Bus Bandwidth (busbw) we can get 160.10 GB/s between 8 gpus placed over 2 nodes.