Podman-HPC Multi-node¶
podman-hpc
has been set-up with 2 extra flags:
--mpi-trial
This adds OpenMPI and PMIx to the container.--nccl-trial
This adds NCCL, OpenMPI, and PMIx.
These flags will mount the required host dependencies into the container. An entrypoint script will also set the environment.
Below are examples that show how to use these flags in conjunction with nccl-tests
(for NCCL / AI workloads).
Entrypoints
The podman-hpc
flags will modify the entrypoint of the container to execute /host/adapt.sh
. You will have to execute your entrypoint manually after starting the container for it to run. If you would like to not run the adapt.sh
entrypoint you can set it as empty podman-hpc run --entrypoint= <CONTAINER_IMAGE>
.
Using PMIx
The current OpenMPI on the system is only built with PMIx support. When you execute jobs with containers you must use srun --mpi=pmix
. Please see the MPI documentation.
Example: Running nccl-tests
in podman-hpc¶
Here shown is an example of how to build nccl-tests
on an NGC container without rebuilding the container itself. This is built using the host's MPI. We use the nccl-tests repository and the nvcr.io/nvidia/pytorch:24.04-py3 NGC container to have access to its compiler suite. First nccl-tests
is cloned, and the current working directories are mounted in the container for the build to persist.
$ podman-hpc pull nvcr.io/nvidia/pytorch:24.04-py3
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ podman-hpc run -it --rm --nccl-trial -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 bash
This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.
Please read the relevant documentation for using containers on multi-node jobs:
https://docs.isambard.ac.uk/user-documentation/guides/containers-multi-node/
user.project@nid001040:~$ make -j 32 MPI=1 MPI_HOME=/host/ompi NCCL_HOME=/host/nccl CUDA_HOME=${CUDA_HOME}
[build output...]
user.project@nid001040:~$ exit
In the execution of the container with podman run
note the use of the following flags:
-it
Run interactively.--rm
Remove the container after exiting.--nccl-trial
Add NCCL, OpenMPI and PMIx to the container.-v $PWD:$PWD
Mount the current working directory in the same path.-w $PWD
Set the same working directory inside the container.bash
This has to be added for the entrypoint script to execute interactively.
After exiting the container we can now srun
to dispatch our job. To execute the benchmarks we must add --mpi=pmix
for Slurm to interface using PMIx, and the --mpi-trial
flag to podman-hpc
:
$ cd nccl-tests
$ export ALL_REDUCE_BIN=$PWD/build/all_reduce_perf
$ srun -N 2 --gpus=8 --mpi=pmix --cpus-per-task 72 --ntasks-per-node 1 --mpi=pmix \
podman-hpc run --nccl-trial --gpu -v $PWD:$PWD -w $PWD nvcr.io/nvidia/pytorch:24.04-py3 \
$ALL_REDUCE_BIN -b 32KB -e 8GB -f 2 -g 4
srun: job 23916 queued and waiting for resources
srun: job 23916 has been allocated resources
This container has been adapted to run MPI and NCCL applications on Isambard-AI.
The LD_LIBRARY_PATH and PATH have been set to include the OpenMPI and NCCL libraries.
Please read the relevant documentation for using containers on multi-node jobs:
https://docs.isambard.ac.uk/user-documentation/guides/containers-multi-node/
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 242178 on nid001007 device 0 [0x01] GH200 120GB
# Rank 1 Group 0 Pid 242178 on nid001007 device 1 [0x01] GH200 120GB
# Rank 2 Group 0 Pid 242178 on nid001007 device 2 [0x01] GH200 120GB
# Rank 3 Group 0 Pid 242178 on nid001007 device 3 [0x01] GH200 120GB
# Rank 4 Group 0 Pid 77688 on nid001008 device 0 [0x01] GH200 120GB
# Rank 5 Group 0 Pid 77688 on nid001008 device 1 [0x01] GH200 120GB
# Rank 6 Group 0 Pid 77688 on nid001008 device 2 [0x01] GH200 120GB
# Rank 7 Group 0 Pid 77688 on nid001008 device 3 [0x01] GH200 120GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
32768 8192 float sum -1 47.55 0.69 1.21 0 42.83 0.77 1.34 0
65536 16384 float sum -1 58.63 1.12 1.96 0 57.88 1.13 1.98 0
131072 32768 float sum -1 80.02 1.64 2.87 0 57.96 2.26 3.96 0
262144 65536 float sum -1 326.2 0.80 1.41 0 322.6 0.81 1.42 0
524288 131072 float sum -1 1666.9 0.31 0.55 0 1685.7 0.31 0.54 0
1048576 262144 float sum -1 6373.3 0.16 0.29 0 6020.4 0.17 0.30 0
2097152 524288 float sum -1 1004.1 2.09 3.65 0 1009.2 2.08 3.64 0
4194304 1048576 float sum -1 1013.7 4.14 7.24 0 1015.6 4.13 7.23 0
8388608 2097152 float sum -1 1347.2 6.23 10.90 0 1351.1 6.21 10.87 0
16777216 4194304 float sum -1 1368.8 12.26 21.45 0 1364.8 12.29 21.51 0
33554432 8388608 float sum -1 1420.4 23.62 41.34 0 1409.5 23.81 41.66 0
67108864 16777216 float sum -1 1057.0 63.49 111.11 0 1051.7 63.81 111.67 0
134217728 33554432 float sum -1 1820.7 73.72 129.00 0 1812.2 74.06 129.61 0
268435456 67108864 float sum -1 3298.4 81.38 142.42 0 3293.8 81.50 142.62 0
536870912 134217728 float sum -1 6214.5 86.39 151.18 0 6212.5 86.42 151.23 0
1073741824 268435456 float sum -1 12055 89.07 155.87 0 12051 89.10 155.93 0
2147483648 536870912 float sum -1 23709 90.58 158.51 0 23727 90.51 158.39 0
4294967296 1073741824 float sum -1 47087 91.21 159.62 0 47080 91.23 159.65 0
8589934592 2147483648 float sum -1 93892 91.49 160.10 0 93863 91.52 160.15 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 66.4308
#
Under Bus Bandwidth (busbw) we can get 160.10 GB/s between 8 gpus placed over 2 nodes.