Podman-HPC Multi-node¶
Prerequisites
Please make sure you are familiar with Podman-HPC, MPI, and PMI guides.
On Isambard-AI podman-hpc
has been set-up with these flags:
--mpich
: This adds NCCL, MPICH, and Cray PMI.--openmpi-pmix
: This adds NCCL, OpenMPI, and PMIx.
These flags will mount the required host dependencies into the container. An entrypoint script (/host/adapt.sh
) will also set the environment. Below are examples that show how to use these flags in conjunction with osu-micro-benchmarks
(for MPI applications) and nccl-tests
(for NCCL / AI workloads).
Entrypoints
The podman-hpc
flags will modify the entrypoint of the container to execute /host/adapt.sh
. You will have to execute your entrypoint manually after starting the container for it to run. If you would like to not run the adapt.sh
entrypoint you can set it as empty podman-hpc run --entrypoint= <CONTAINER_IMAGE>
.
Using PMIx
The current OpenMPI on the system is only built with PMIx support. When you execute jobs with containers you must use srun --mpi=pmix
. Please see the MPI documentation.
Example: Running osu-micro-benchmarks with Podman-HPC¶
To showcase using Podman-HPC containers running OSU micro-benchmarks we will use the --mpich
flag. We will use the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compilers and build environment.
$ pwd
/projects/brics/public/test/podman-hpc/mpich
$ podman-hpc pull nvcr.io/nvidia/pytorch:24.04-py3
$ wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.4.tar.gz
$ tar -xf osu-micro-benchmarks-7.4.tar.gz
$ cd osu-micro-benchmarks-7.4
$ podman-hpc run -it --gpu --mpich -v /projects:/projects -v $PWD:$PWD:rw nvcr.io/nvidia/pytorch:24.04-py3 bash
root@nid001041 $ ./configure CC=mpicc CXX=mpicxx && make && make install
[build output ...]
We can then use srun
to execute the micro-benchmarks with the correct flags:
$ srun -N 2 --gpus=2 --cpus-per-task 72 --ntasks-per-node 1 \
podman-hpc run -it --rm --gpu --mpich nvcr.io/nvidia/pytorch:24.04-py3 \
osu-micro-benchmarks-7.4/c/mpi/pt2pt/standard/osu_latency
Isambard-AI. Container adapted and running to run MPICH and NCCL on nid001004. https://docs.isambard.ac.uk/user-documentation/guides/containers/
Isambard-AI. Container adapted and running to run MPICH and NCCL on nid001005. https://docs.isambard.ac.uk/user-documentation/guides/containers/
# OSU MPI Latency Test v7.4
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 3.07
2 2.87
4 3.03
8 2.88
16 3.03
32 2.87
64 2.92
128 3.07
256 3.60
512 3.79
1024 3.88
2048 3.88
4096 4.31
8192 4.56
16384 4.99
32768 8.44
65536 9.78
131072 14.05
262144 17.87
524288 30.23
1048576 50.29
2097152 95.15
4194304 186.46
Example: Running nccl-tests
with Podman-HPC¶
Using NGC containers?
If you are running NGC (Nvidia GPU Cloud) containers it is recommended that you use --openmpi-pmix
to adapt the MPI in the container for the system.
Here shown is an example of how to build nccl-tests
on an NGC container without rebuilding the container itself. This is built using the host's MPI. We use the nccl-tests repository and the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compiler suite. First nccl-tests
is cloned, and the current working directories are mounted in the container for the build to persist.
$ podman-hpc pull nvcr.io/nvidia/pytorch:24.04-py3
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ podman-hpc run -it --rm --openmpi-pmix -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 bash
Isambard-AI. Container adapted and running to run OpenMPI and NCCL on nid001040. https://docs.isambard.ac.uk/user-documentation/guides/containers/
user.project@nid001040:~$ make -j 32 MPI=1 MPI_HOME=/host/ompi NCCL_HOME=/host/nccl CUDA_HOME=${CUDA_HOME}
[build output...]
user.project@nid001040:~$ exit
In the execution of the container with podman run
note the use of the following flags:
-it
Run interactively.--rm
Remove the container after exiting.--openmpi-pmix
Add NCCL, OpenMPI and PMIx to the container.-v $PWD:$PWD
Mount the current working directory in the same path.-w $PWD
Set the same working directory inside the container.bash
This has to be added for the entrypoint script to execute interactively.
After exiting the container we can now srun
to dispatch our job. To execute the benchmarks we must add --mpi=pmix
for Slurm to interface using PMIx, and the --openmpi-pmix
flag to podman-hpc
:
$ cd nccl-tests
$ export ALL_REDUCE_BIN=$PWD/build/all_reduce_perf
$ srun -N 2 --gpus=8 --mpi=pmix --cpus-per-task 72 --ntasks-per-node 1 --mpi=pmix \
podman-hpc run --openmpi-pmix --gpu -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 \
$ALL_REDUCE_BIN -b 32KB -e 8GB -f 2 -g 4
srun: job 23916 queued and waiting for resources
srun: job 23916 has been allocated resources
Isambard-AI. Container adapted and running to run OpenMPI and NCCL on nid001031. https://docs.isambard.ac.uk/user-documentation/guides/containers/
Isambard-AI. Container adapted and running to run OpenMPI and NCCL on nid001032. https://docs.isambard.ac.uk/user-documentation/guides/containers/
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 30441 on nid001031 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 30441 on nid001031 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 30441 on nid001031 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 30441 on nid001031 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 153863 on nid001032 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 153863 on nid001032 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 153863 on nid001032 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 153863 on nid001032 device 3 [0039:01:00] NVIDIA GH200 120GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
32768 8192 float sum -1 43.93 0.75 1.31 0 39.52 0.83 1.45 0
65536 16384 float sum -1 141.5 0.46 0.81 0 141.1 0.46 0.81 0
131072 32768 float sum -1 153.2 0.86 1.50 0 162.4 0.81 1.41 0
262144 65536 float sum -1 313.4 0.84 1.46 0 322.1 0.81 1.42 0
524288 131072 float sum -1 628.2 0.83 1.46 0 1098.0 0.48 0.84 0
1048576 262144 float sum -1 893.7 1.17 2.05 0 910.1 1.15 2.02 0
2097152 524288 float sum -1 136.8 15.33 26.82 0 135.0 15.53 27.18 0
4194304 1048576 float sum -1 186.3 22.52 39.40 0 185.9 22.57 39.49 0
8388608 2097152 float sum -1 290.0 28.92 50.62 0 283.8 29.56 51.72 0
16777216 4194304 float sum -1 425.4 39.44 69.02 0 422.9 39.67 69.42 0
33554432 8388608 float sum -1 634.7 52.87 92.52 0 632.5 53.05 92.85 0
67108864 16777216 float sum -1 1080.9 62.09 108.66 0 1081.1 62.08 108.63 0
134217728 33554432 float sum -1 1853.2 72.42 126.74 0 1838.9 72.99 127.73 0
268435456 67108864 float sum -1 3321.7 80.81 141.42 0 3321.5 80.82 141.43 0
536870912 134217728 float sum -1 6248.9 85.91 150.35 0 6249.7 85.90 150.33 0
1073741824 268435456 float sum -1 12138 88.46 154.81 0 12144 88.42 154.73 0
2147483648 536870912 float sum -1 23883 89.92 157.36 0 23886 89.90 157.33 0
4294967296 1073741824 float sum -1 47399 90.61 158.57 0 47416 90.58 158.52 0
8589934592 2147483648 float sum -1 94425 90.97 159.20 0 94407 90.99 159.23 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 76.069
#
Under Bus Bandwidth (busbw) we can get 159.20
GB/s between 8 gpus placed over 2 nodes.