Skip to content
  • AIP1 Isambard-AI Phase 1 supported
  • AIP2 Isambard-AI Phase 2 supported
  • I3 Isambard 3 unsupported
  • BC5 BlueCrystal 5 unsupported

NCCL

The NVIDIA Collective Communication Library (NCCL) provides direct GPU-GPU communication for NVIDIA GPUs. This guide will show how NCCL can be used on Isambard-AI's HPE Slingshot 11 high speed network.

NCCL provides RDMA (Remote Direct Memory Access) via GPUDirect. To use this on the Slingshot network, you must use the aws-ofi-nccl network plugin.

NCCL and aws-ofi-nccl are provided on the system as follows:

$ module load brics/nccl

The rest of this guide explains optimisations to NCCL on Slingshot as they are application dependent.

PyTorch and NCCL

Older versions of PyTorch shipped with a statically linked NCCL in third_party/nccl. Since July 2025 PyTorch depends on the system nccl.

Environment Variables

NCCL Environment variables are used to optimise the performance on the Slingshot network. The official NCCL documentation has in-depth explanations of the environment variables.

These variables are set in the module brics/nccl, but are explained here for application-specific optimisation or if you are building your own version.

# Use aws-ofi-nccl
export NCCL_NET="AWS Libfabric"
# Use the high speed network interface
export NCCL_SOCKET_IFNAME="hsn"
# Print the NCCL version at startup
export NCCL_DEBUG="VERSION"
# Use P2P when GPUs are on the same NUMA node.
export NCCL_NET_GDR_LEVEL="PHB"
# Allow rings/trees to use different NICs due to Slingshot topology.
export NCCL_CROSS_NIC="1"
export NCCL_MIN_NCHANNELS="4"
export NCCL_GDRCOPY_ENABLE="1"

# FI (libfabric) environment variables to optimise NCCL on Slingshot
export FI_CXI_DEFAULT_CQ_SIZE="131072"
export FI_CXI_DEFAULT_TX_SIZE="1024"
export FI_CXI_DISABLE_NON_INJECT_MSG_IDC="1"
export FI_HMEM_CUDA_USE_GDRCOPY="1"
# Setting the cache monitor and host register prevents NCCL hangs / deadlocks
export FI_CXI_DISABLE_HOST_REGISTER="1"
export FI_MR_CACHE_MONITOR="userfaultfd"
# Further optimisation with the alternative rendezvous protocol
export FI_CXI_RDZV_PROTO="alt_read"
export FI_CXI_RDZV_THRESHOLD="0"
export FI_CXI_RDZV_GET_MIN="0"
export FI_CXI_RDZV_EAGER_SIZE="0"

Some optimisations are application dependent

The default system settings optimise for a wide range of scenarios. However, it is important to note the optimisations are application dependent, namely NCCL_MIN_NCHANNELS, FI_CXI_DEFAULT_CQ_SIZE, and FI_CXI_DEFAULT_TX_SIZE.

For more information see The official NCCL documentation on environment variables.

NCCL and AWS Libfabric

The environment variable NCCL_NET="AWS Libfabric" will force NCCL to use the aws-ofi-nccl plugin. If the plugin isn't setup in your environment you will get the following error:

NCCL WARN Error: network AWS Libfabric not found.

Debugging

When debugging NCCL ensure you set NCCL_DEBUG=INFO to get the NCCL debug logs.

Installing and Benchmarking NCCL

Installing nccl

If you need a specific version of NCCL - or you are trying to build a custom container, you can build NCCL with the following instructions:

$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v2.27.7-1" https://github.com/NVIDIA/nccl.git
$ cd nccl
$ mkdir build
$ make -j 8 src.build PREFIX=$(realpath build)

Installing aws-ofi-nccl

The NCCL network plugin aws-ofi-nccl must be built for NCCL to use the Slingshot high speed network. It depends on CUDA and Libfabric. Ensure you use a version that is tagged with aws. The network plugin can be installed as follows:

$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v1.7.x-aws" https://github.com/aws/aws-ofi-nccl.git
$ cd aws-ofi-nccl
$ ./autogen.sh 
$ export LIBFABRIC_HOME=/opt/cray/libfabric/1.22.0 CC=/usr/bin/gcc-12 CXX=/usr/bin/g++-12
$ mkdir build
$ ./configure --prefix=$(realpath build) --with-cuda=${CUDA_HOME} --with-libfabric=${LIBFABRIC_HOME} --disable-tests
$ make -j 8 install

Benchmarking NCCL with nccl-tests

NCCL tests depends on MPI, NCCL, and CUDA. We can build it with the following instructions:

$ module load cudatoolkit PrgEnv-gnu
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ export MPI_HOME=/opt/cray/pe/mpich/8.1.32/ofi/gnu/12.3/ 
# Make sure you set NCCL_HOME to your build directory
$ export NCCL_HOME=$(realpath ../nccl/build)
$ make -j 8 MPI=1 MPI_HOME=${MPI_HOME} NCCL_HOME=${NCCL_HOME} CUDA_HOME=${CUDA_HOME}

We can now run an All Reduce benchmark to measure the network bandwidth for collectives. Running without the plugin, we see we have a maximum Bus Bandwidth of busbw 2.32 GB/s.

$ srun -N 2 --gpus 8 --network=disable_rdzv_get $(realpath nccl-tests/build/all_reduce_perf) -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 8 maxBytes 4294967296 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 123253 on  nid001024 device  0 [0009:01:00] NVIDIA GH200 120GB
[...]
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
[...]
  8589934592    2147483648     float     sum      -1  6492425    1.32    2.32       0  6488797    1.32    2.32       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.3806
#
# Collective test concluded: all_reduce_perf

The above shows very slow performance indicating that NCCL is not using RDMA and is instead using a TCP Socket interface. To enable RDMA, we will source the environment variables above and add the plugin to our environment:

$ export LD_LIBRARY_PATH=$(realpath aws-ofi-nccl/build/lib):$LD_LIBRARY_PATH
# Ensure you source the environment variables from above
$ srun -N 2 --gpus 8 --cpus-per-task 72 --network=disable_rdzv_get nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 125488 on  nid001024 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 125488 on  nid001024 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 125488 on  nid001024 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 125488 on  nid001024 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 136269 on  nid001025 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 136269 on  nid001025 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 136269 on  nid001025 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 136269 on  nid001025 device  3 [0039:01:00] NVIDIA GH200 120GB
NCCL version 2.19.3+cuda12.3
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
[...]
  8589934592    2147483648     float     sum      -1  92400.1   92.96  162.69       0  92381.3   92.98  162.72       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 62.3908 
#
# Collective test concluded: all_reduce_perf

Here you can see that NCCL is correctly using RDMA on Slingshot, achieving an All Reduce out-of-place bus bandwidth of 162.69 GB/s.

Resources