- ✓ AIP1 Isambard-AI Phase 1 supported
- ✓ AIP2 Isambard-AI Phase 2 supported
- ✗ I3 Isambard 3 unsupported
- ✗ BC5 BlueCrystal 5 unsupported
NCCL
The NVIDIA Collective Communication Library (NCCL) provides direct GPU-GPU communication for NVIDIA GPUs. This guide will show how NCCL can be used on Isambard-AI's HPE Slingshot 11 high speed network.
NCCL provides RDMA (Remote Direct Memory Access) via GPUDirect. To use this on the Slingshot network, you must use the aws-ofi-nccl network plugin.
NCCL and aws-ofi-nccl are provided on the system as follows:
$ module load brics/nccl
The rest of this guide explains optimisations to NCCL on Slingshot as they are application dependent.
Environment Variables¶
NCCL Environment variables are used to optimise the performance on the Slingshot network. The official NCCL documentation has in-depth explanations of the environment variables.
These variables are set in the module brics/nccl, but are explained here for application-specific optimisation or if you are building your own version.
# Use aws-ofi-nccl
export NCCL_NET="AWS Libfabric"
# Use the high speed network interface
export NCCL_SOCKET_IFNAME="hsn"
# Print the NCCL version at startup
export NCCL_DEBUG="VERSION"
# Use P2P when GPUs are on the same NUMA node.
export NCCL_NET_GDR_LEVEL="PHB"
# Allow rings/trees to use different NICs due to Slingshot topology.
export NCCL_CROSS_NIC="1"
export NCCL_MIN_NCHANNELS="4"
export NCCL_GDRCOPY_ENABLE="1"
# FI (libfabric) environment variables to optimise NCCL on Slingshot
export FI_CXI_DEFAULT_CQ_SIZE="131072"
export FI_CXI_DEFAULT_TX_SIZE="1024"
export FI_CXI_DISABLE_NON_INJECT_MSG_IDC="1"
export FI_HMEM_CUDA_USE_GDRCOPY="1"
# Setting the cache monitor and host register prevents NCCL hangs / deadlocks
export FI_CXI_DISABLE_HOST_REGISTER="1"
export FI_MR_CACHE_MONITOR="userfaultfd"
# Further optimisation with the alternative rendezvous protocol
export FI_CXI_RDZV_PROTO="alt_read"
export FI_CXI_RDZV_THRESHOLD="0"
export FI_CXI_RDZV_GET_MIN="0"
export FI_CXI_RDZV_EAGER_SIZE="0"
Some optimisations are application dependent
The default system settings optimise for a wide range of scenarios. However, it is important to note the optimisations are application dependent, namely NCCL_MIN_NCHANNELS, FI_CXI_DEFAULT_CQ_SIZE, and FI_CXI_DEFAULT_TX_SIZE.
For more information see The official NCCL documentation on environment variables.
NCCL and AWS Libfabric
The environment variable NCCL_NET="AWS Libfabric" will force NCCL to use the aws-ofi-nccl plugin. If the plugin isn't setup in your environment you will get the following error:
NCCL WARN Error: network AWS Libfabric not found.
Debugging
When debugging NCCL ensure you set NCCL_DEBUG=INFO to get the NCCL debug logs.
Installing and Benchmarking NCCL¶
Installing nccl¶
If you need a specific version of NCCL - or you are trying to build a custom container, you can build NCCL with the following instructions:
$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v2.27.7-1" https://github.com/NVIDIA/nccl.git
$ cd nccl
$ mkdir build
$ make -j 8 src.build PREFIX=$(realpath build)
Installing aws-ofi-nccl¶
The NCCL network plugin aws-ofi-nccl must be built for NCCL to use the Slingshot high speed network. It depends on CUDA and Libfabric. Ensure you use a version that is tagged with aws. The network plugin can be installed as follows:
$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v1.7.x-aws" https://github.com/aws/aws-ofi-nccl.git
$ cd aws-ofi-nccl
$ ./autogen.sh
$ export LIBFABRIC_HOME=/opt/cray/libfabric/1.22.0 CC=/usr/bin/gcc-12 CXX=/usr/bin/g++-12
$ mkdir build
$ ./configure --prefix=$(realpath build) --with-cuda=${CUDA_HOME} --with-libfabric=${LIBFABRIC_HOME} --disable-tests
$ make -j 8 install
Benchmarking NCCL with nccl-tests¶
NCCL tests depends on MPI, NCCL, and CUDA. We can build it with the following instructions:
$ module load cudatoolkit PrgEnv-gnu
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ export MPI_HOME=/opt/cray/pe/mpich/8.1.32/ofi/gnu/12.3/
# Make sure you set NCCL_HOME to your build directory
$ export NCCL_HOME=$(realpath ../nccl/build)
$ make -j 8 MPI=1 MPI_HOME=${MPI_HOME} NCCL_HOME=${NCCL_HOME} CUDA_HOME=${CUDA_HOME}
We can now run an All Reduce benchmark to measure the network bandwidth for collectives. Running without the plugin, we see we have a maximum Bus Bandwidth of busbw 2.32 GB/s.
$ srun -N 2 --gpus 8 --network=disable_rdzv_get $(realpath nccl-tests/build/all_reduce_perf) -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 8 maxBytes 4294967296 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 123253 on nid001024 device 0 [0009:01:00] NVIDIA GH200 120GB
[...]
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[...]
8589934592 2147483648 float sum -1 6492425 1.32 2.32 0 6488797 1.32 2.32 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.3806
#
# Collective test concluded: all_reduce_perf
The above shows very slow performance indicating that NCCL is not using RDMA and is instead using a TCP Socket interface. To enable RDMA, we will source the environment variables above and add the plugin to our environment:
$ export LD_LIBRARY_PATH=$(realpath aws-ofi-nccl/build/lib):$LD_LIBRARY_PATH
# Ensure you source the environment variables from above
$ srun -N 2 --gpus 8 --cpus-per-task 72 --network=disable_rdzv_get nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 125488 on nid001024 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 125488 on nid001024 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 125488 on nid001024 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 125488 on nid001024 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 136269 on nid001025 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 136269 on nid001025 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 136269 on nid001025 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 136269 on nid001025 device 3 [0039:01:00] NVIDIA GH200 120GB
NCCL version 2.19.3+cuda12.3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[...]
8589934592 2147483648 float sum -1 92400.1 92.96 162.69 0 92381.3 92.98 162.72 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 62.3908
#
# Collective test concluded: all_reduce_perf
Here you can see that NCCL is correctly using RDMA on Slingshot, achieving an All Reduce out-of-place bus bandwidth of 162.69 GB/s.