Skip to content

Podman-HPC Multi-node

Prerequisites

Please make sure you are familiar with Podman-HPC, MPI, and PMI guides.

On Isambard-AI podman-hpc has been set-up with these flags:

  1. --mpich: This adds NCCL, MPICH, and Cray PMI.
  2. --openmpi-pmix: This adds NCCL, OpenMPI, and PMIx.

These flags will mount the required host dependencies into the container. An entrypoint script (/host/adapt.sh) will also set the environment. Below are examples that show how to use these flags in conjunction with osu-micro-benchmarks (for MPI applications) and nccl-tests (for NCCL / AI workloads).

Entrypoints

The podman-hpc flags will modify the entrypoint of the container to execute /host/adapt.sh. You will have to execute your entrypoint manually after starting the container for it to run. If you would like to not run the adapt.sh entrypoint you can set it as empty podman-hpc run --entrypoint= <CONTAINER_IMAGE>.

Using PMIx

The current OpenMPI on the system is only built with PMIx support. When you execute jobs with containers you must use srun --mpi=pmix. Please see the MPI documentation.

Example: Running osu-micro-benchmarks with Podman-HPC

To showcase using Podman-HPC containers running OSU micro-benchmarks we will use the --mpich flag. We will use the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compilers and build environment.

$ pwd
/projects/brics/public/test/podman-hpc/mpich
$ podman-hpc pull nvcr.io/nvidia/pytorch:24.04-py3
$ wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.4.tar.gz
$ tar -xf osu-micro-benchmarks-7.4.tar.gz
$ cd osu-micro-benchmarks-7.4
$ podman-hpc run -it --gpu --mpich -v /projects:/projects -v $PWD:$PWD:rw nvcr.io/nvidia/pytorch:24.04-py3 bash
root@nid001041 $ ./configure CC=mpicc CXX=mpicxx && make && make install 
[build output ...]

We can then use srun to execute the micro-benchmarks with the correct flags:

$ srun -N 2 --gpus=2 --cpus-per-task 72 --ntasks-per-node 1 \
    podman-hpc run -it --rm --gpu --mpich nvcr.io/nvidia/pytorch:24.04-py3 \
    osu-micro-benchmarks-7.4/c/mpi/pt2pt/standard/osu_latency

Isambard-AI. Container adapted and running to run MPICH and NCCL on nid001004. https://docs.isambard.ac.uk/user-documentation/guides/containers/

Isambard-AI. Container adapted and running to run MPICH and NCCL on nid001005. https://docs.isambard.ac.uk/user-documentation/guides/containers/

# OSU MPI Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       3.07
2                       2.87
4                       3.03
8                       2.88
16                      3.03
32                      2.87
64                      2.92
128                     3.07
256                     3.60
512                     3.79
1024                    3.88
2048                    3.88
4096                    4.31
8192                    4.56
16384                   4.99
32768                   8.44
65536                   9.78
131072                 14.05
262144                 17.87
524288                 30.23
1048576                50.29
2097152                95.15
4194304               186.46

Example: Running nccl-tests with Podman-HPC

Using NGC containers?

If you are running NGC (Nvidia GPU Cloud) containers it is recommended that you use --openmpi-pmix to adapt the MPI in the container for the system.

Here shown is an example of how to build nccl-tests on an NGC container without rebuilding the container itself. This is built using the host's MPI. We use the nccl-tests repository and the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compiler suite. First nccl-tests is cloned, and the current working directories are mounted in the container for the build to persist.

$ podman-hpc pull nvcr.io/nvidia/pytorch:24.04-py3
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ podman-hpc run -it --rm --openmpi-pmix -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 bash

Isambard-AI. Container adapted and running to run OpenMPI and NCCL on nid001040. https://docs.isambard.ac.uk/user-documentation/guides/containers/

user.project@nid001040:~$ make -j 32 MPI=1 MPI_HOME=/host/ompi NCCL_HOME=/host/nccl CUDA_HOME=${CUDA_HOME}
[build output...]
user.project@nid001040:~$ exit

In the execution of the container with podman run note the use of the following flags:

  1. -it Run interactively.
  2. --rm Remove the container after exiting.
  3. --openmpi-pmix Add NCCL, OpenMPI and PMIx to the container.
  4. -v $PWD:$PWD Mount the current working directory in the same path.
  5. -w $PWD Set the same working directory inside the container.
  6. bash This has to be added for the entrypoint script to execute interactively.

After exiting the container we can now srun to dispatch our job. To execute the benchmarks we must add --mpi=pmix for Slurm to interface using PMIx, and the --openmpi-pmix flag to podman-hpc:

$ cd nccl-tests
$ export ALL_REDUCE_BIN=$PWD/build/all_reduce_perf
$ srun -N 2 --gpus=8 --mpi=pmix --cpus-per-task 72 --ntasks-per-node 1 --mpi=pmix \
    podman-hpc run --openmpi-pmix --gpu -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 \
    $ALL_REDUCE_BIN -b 32KB -e 8GB -f 2 -g 4
srun: job 23916 queued and waiting for resources
srun: job 23916 has been allocated resources

Isambard-AI. Container adapted and running to run OpenMPI and NCCL on nid001031. https://docs.isambard.ac.uk/user-documentation/guides/containers/

Isambard-AI. Container adapted and running to run OpenMPI and NCCL on nid001032. https://docs.isambard.ac.uk/user-documentation/guides/containers/

# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  30441 on  nid001031 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid  30441 on  nid001031 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid  30441 on  nid001031 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid  30441 on  nid001031 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 153863 on  nid001032 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 153863 on  nid001032 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 153863 on  nid001032 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 153863 on  nid001032 device  3 [0039:01:00] NVIDIA GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       32768          8192     float     sum      -1    43.93    0.75    1.31      0    39.52    0.83    1.45      0
       65536         16384     float     sum      -1    141.5    0.46    0.81      0    141.1    0.46    0.81      0
      131072         32768     float     sum      -1    153.2    0.86    1.50      0    162.4    0.81    1.41      0
      262144         65536     float     sum      -1    313.4    0.84    1.46      0    322.1    0.81    1.42      0
      524288        131072     float     sum      -1    628.2    0.83    1.46      0   1098.0    0.48    0.84      0
     1048576        262144     float     sum      -1    893.7    1.17    2.05      0    910.1    1.15    2.02      0
     2097152        524288     float     sum      -1    136.8   15.33   26.82      0    135.0   15.53   27.18      0
     4194304       1048576     float     sum      -1    186.3   22.52   39.40      0    185.9   22.57   39.49      0
     8388608       2097152     float     sum      -1    290.0   28.92   50.62      0    283.8   29.56   51.72      0
    16777216       4194304     float     sum      -1    425.4   39.44   69.02      0    422.9   39.67   69.42      0
    33554432       8388608     float     sum      -1    634.7   52.87   92.52      0    632.5   53.05   92.85      0
    67108864      16777216     float     sum      -1   1080.9   62.09  108.66      0   1081.1   62.08  108.63      0
   134217728      33554432     float     sum      -1   1853.2   72.42  126.74      0   1838.9   72.99  127.73      0
   268435456      67108864     float     sum      -1   3321.7   80.81  141.42      0   3321.5   80.82  141.43      0
   536870912     134217728     float     sum      -1   6248.9   85.91  150.35      0   6249.7   85.90  150.33      0
  1073741824     268435456     float     sum      -1    12138   88.46  154.81      0    12144   88.42  154.73      0
  2147483648     536870912     float     sum      -1    23883   89.92  157.36      0    23886   89.90  157.33      0
  4294967296    1073741824     float     sum      -1    47399   90.61  158.57      0    47416   90.58  158.52      0
  8589934592    2147483648     float     sum      -1    94425   90.97  159.20      0    94407   90.99  159.23      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 76.069 
#

Under Bus Bandwidth (busbw) we can get 159.20 GB/s between 8 gpus placed over 2 nodes.