- ✓ AIP1 Isambard-AI Phase 1 supported
- ✓ AIP2 Isambard-AI Phase 2 supported
- ✗ I3 Isambard 3 unsupported
- ✗ BC5 BlueCrystal 5 unsupported
NCCL
The NVIDIA Collective Communication Library (NCCL) provides direct GPU-GPU communication for NVIDIA GPUs. This guide will show how NCCL can be used on Isambard-AI's HPE Slingshot 11 high speed network.
NCCL provides RDMA (Remote Direct Memory Access) via GPUDirect. To use this on the Slingshot network, you must use the aws-ofi-nccl network plugin.
NCCL and aws-ofi-nccl are provided on the system as follows:
$ module load brics/nccl
The rest of this guide explains optimisations to NCCL on Slingshot as they are application dependent.
PyTorch and NCCL
Older versions of PyTorch shipped with a statically linked NCCL in third_party/nccl. Since July 2025 PyTorch depends on the system nccl.
Environment Variables¶
NCCL Environment variables are used to optimise the performance on the Slingshot network. The official NCCL documentation has in-depth explanations of the environment variables.
These variables are set in the module brics/nccl, but are explained here for application-specific optimisation or if you are building your own version.
# Use aws-ofi-nccl
export NCCL_NET="AWS Libfabric"
# Use the high speed network interface
export NCCL_SOCKET_IFNAME="hsn"
# Print the NCCL version at startup
export NCCL_DEBUG="VERSION"
# Use P2P when GPUs are on the same NUMA node.
export NCCL_NET_GDR_LEVEL="PHB"
# Allow rings/trees to use different NICs due to Slingshot topology.
export NCCL_CROSS_NIC="1"
export NCCL_MIN_NCHANNELS="4"
export NCCL_GDRCOPY_ENABLE="1"
# FI (libfabric) environment variables to optimise NCCL on Slingshot
export FI_CXI_DEFAULT_CQ_SIZE="131072"
export FI_CXI_DEFAULT_TX_SIZE="1024"
export FI_CXI_DISABLE_NON_INJECT_MSG_IDC="1"
export FI_HMEM_CUDA_USE_GDRCOPY="1"
# Setting the cache monitor and host register prevents NCCL hangs / deadlocks
export FI_CXI_DISABLE_HOST_REGISTER="1"
export FI_MR_CACHE_MONITOR="userfaultfd"
# Further optimisation with the alternative rendezvous protocol
export FI_CXI_RDZV_PROTO="alt_read"
export FI_CXI_RDZV_THRESHOLD="0"
export FI_CXI_RDZV_GET_MIN="0"
export FI_CXI_RDZV_EAGER_SIZE="0"
Some optimisations are application dependent
The default system settings optimise for a wide range of scenarios. However, it is important to note the optimisations are application dependent, namely NCCL_MIN_NCHANNELS, FI_CXI_DEFAULT_CQ_SIZE, and FI_CXI_DEFAULT_TX_SIZE.
For more information see The official NCCL documentation on environment variables.
NCCL and AWS Libfabric
The environment variable NCCL_NET="AWS Libfabric" will force NCCL to use the aws-ofi-nccl plugin. If the plugin isn't setup in your environment you will get the following error:
NCCL WARN Error: network AWS Libfabric not found.
Debugging
When debugging NCCL ensure you set NCCL_DEBUG=INFO to get the NCCL debug logs.
Installing and Benchmarking NCCL¶
Installing nccl¶
If you need a specific version of NCCL - or you are trying to build a custom container, you can build NCCL with the following instructions:
$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v2.27.7-1" https://github.com/NVIDIA/nccl.git
$ cd nccl
$ mkdir build
$ make -j 8 src.build PREFIX=$(realpath build)
Installing aws-ofi-nccl¶
The NCCL network plugin aws-ofi-nccl must be built for NCCL to use the Slingshot high speed network. It depends on CUDA and Libfabric. Ensure you use a version that is tagged with aws. The network plugin can be installed as follows:
$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v1.7.x-aws" https://github.com/aws/aws-ofi-nccl.git
$ cd aws-ofi-nccl
$ ./autogen.sh
$ export LIBFABRIC_HOME=/opt/cray/libfabric/1.22.0 CC=/usr/bin/gcc-12 CXX=/usr/bin/g++-12
$ mkdir build
$ ./configure --prefix=$(realpath build) --with-cuda=${CUDA_HOME} --with-libfabric=${LIBFABRIC_HOME} --disable-tests
$ make -j 8 install
Benchmarking NCCL with nccl-tests¶
NCCL tests depends on MPI, NCCL, and CUDA. We can build it with the following instructions:
$ module load cudatoolkit PrgEnv-gnu
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ export MPI_HOME=/opt/cray/pe/mpich/8.1.32/ofi/gnu/12.3/
# Make sure you set NCCL_HOME to your build directory
$ export NCCL_HOME=$(realpath ../nccl/build)
$ make -j 8 MPI=1 MPI_HOME=${MPI_HOME} NCCL_HOME=${NCCL_HOME} CUDA_HOME=${CUDA_HOME}
We can now run an All Reduce benchmark to measure the network bandwidth for collectives. Running without the plugin, we see we have a maximum Bus Bandwidth of busbw 2.32 GB/s.
$ srun -N 2 --gpus 8 --network=disable_rdzv_get $(realpath nccl-tests/build/all_reduce_perf) -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 8 maxBytes 4294967296 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 123253 on nid001024 device 0 [0009:01:00] NVIDIA GH200 120GB
[...]
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[...]
8589934592 2147483648 float sum -1 6492425 1.32 2.32 0 6488797 1.32 2.32 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.3806
#
# Collective test concluded: all_reduce_perf
The above shows very slow performance indicating that NCCL is not using RDMA and is instead using a TCP Socket interface. To enable RDMA, we will source the environment variables above and add the plugin to our environment:
$ export LD_LIBRARY_PATH=$(realpath aws-ofi-nccl/build/lib):$LD_LIBRARY_PATH
# Ensure you source the environment variables from above
$ srun -N 2 --gpus 8 --cpus-per-task 72 --network=disable_rdzv_get nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 125488 on nid001024 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 125488 on nid001024 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 125488 on nid001024 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 125488 on nid001024 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 136269 on nid001025 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 136269 on nid001025 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 136269 on nid001025 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 136269 on nid001025 device 3 [0039:01:00] NVIDIA GH200 120GB
NCCL version 2.19.3+cuda12.3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[...]
8589934592 2147483648 float sum -1 92400.1 92.96 162.69 0 92381.3 92.98 162.72 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 62.3908
#
# Collective test concluded: all_reduce_perf
Here you can see that NCCL is correctly using RDMA on Slingshot, achieving an All Reduce out-of-place bus bandwidth of 162.69 GB/s.