Skip to content
  • AIP1 Isambard-AI Phase 1 supported
  • AIP2 Isambard-AI Phase 2 supported
  • I3 Isambard 3 unsupported
  • BC5 BlueCrystal 5 unsupported

NCCL

The NVIDIA Collective Communication Library (NCCL) provides direct GPU-GPU communication for NVIDIA GPUs. This guide will show how NCCL can be used on Isambard-AI's HPE Slingshot 11 high speed network.

NCCL provides RDMA (Remote Direct Memory Access) via GPUDirect. To use this on the Slingshot network, you must use the aws-ofi-nccl network plugin.

NCCL and aws-ofi-nccl are provided on the system as follows:

$ module load brics/nccl

The rest of this guide explains optimisations to NCCL on Slingshot as they are application dependent.

Environment Variables

NCCL Environment variables are used to optimise the performance on the Slingshot network. The official NCCL documentation has in-depth explanations of the environment variables.

These variables are set in the module brics/nccl, but are explained here for application-specific optimisation or if you are building your own version.

# Use aws-ofi-nccl
export NCCL_NET="AWS Libfabric"
# Use the high speed network interface
export NCCL_SOCKET_IFNAME="hsn"
# Print the NCCL version at startup
export NCCL_DEBUG="VERSION"
# Use P2P when GPUs are on the same NUMA node.
export NCCL_NET_GDR_LEVEL="PHB"
# Allow rings/trees to use different NICs due to Slingshot topology.
export NCCL_CROSS_NIC="1"
export NCCL_MIN_NCHANNELS="4"
export NCCL_GDRCOPY_ENABLE="1"

# FI (libfabric) environment variables to optimise NCCL on Slingshot
export FI_CXI_DEFAULT_CQ_SIZE="131072"
export FI_CXI_DEFAULT_TX_SIZE="1024"
export FI_CXI_DISABLE_NON_INJECT_MSG_IDC="1"
export FI_HMEM_CUDA_USE_GDRCOPY="1"
# Setting the cache monitor and host register prevents NCCL hangs / deadlocks
export FI_CXI_DISABLE_HOST_REGISTER="1"
export FI_MR_CACHE_MONITOR="userfaultfd"
# Further optimisation with the alternative rendezvous protocol
export FI_CXI_RDZV_PROTO="alt_read"
export FI_CXI_RDZV_THRESHOLD="0"
export FI_CXI_RDZV_GET_MIN="0"
export FI_CXI_RDZV_EAGER_SIZE="0"

Some optimisations are application dependent

The default system settings optimise for a wide range of scenarios. However, it is important to note the optimisations are application dependent, namely NCCL_MIN_NCHANNELS, FI_CXI_DEFAULT_CQ_SIZE, and FI_CXI_DEFAULT_TX_SIZE.

For more information see The official NCCL documentation on environment variables.

NCCL and AWS Libfabric

The environment variable NCCL_NET="AWS Libfabric" will force NCCL to use the aws-ofi-nccl plugin. If the plugin isn't setup in your environment you will get the following error:

NCCL WARN Error: network AWS Libfabric not found.

Debugging

When debugging NCCL ensure you set NCCL_DEBUG=INFO to get the NCCL debug logs.

Installing and Benchmarking NCCL

Installing nccl

If you need a specific version of NCCL - or you are trying to build a custom container, you can build NCCL with the following instructions:

$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v2.27.7-1" https://github.com/NVIDIA/nccl.git
$ cd nccl
$ mkdir build
$ make -j 8 src.build PREFIX=$(realpath build)

Installing aws-ofi-nccl

The NCCL network plugin aws-ofi-nccl must be built for NCCL to use the Slingshot high speed network. It depends on CUDA and Libfabric. Ensure you use a version that is tagged with aws. The network plugin can be installed as follows:

$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v1.7.x-aws" https://github.com/aws/aws-ofi-nccl.git
$ cd aws-ofi-nccl
$ ./autogen.sh 
$ export LIBFABRIC_HOME=/opt/cray/libfabric/1.22.0 CC=/usr/bin/gcc-12 CXX=/usr/bin/g++-12
$ mkdir build
$ ./configure --prefix=$(realpath build) --with-cuda=${CUDA_HOME} --with-libfabric=${LIBFABRIC_HOME} --disable-tests
$ make -j 8 install

Benchmarking NCCL with nccl-tests

NCCL tests depends on MPI, NCCL, and CUDA. We can build it with the following instructions:

$ module load cudatoolkit PrgEnv-gnu
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ export MPI_HOME=/opt/cray/pe/mpich/8.1.32/ofi/gnu/12.3/ 
# Make sure you set NCCL_HOME to your build directory
$ export NCCL_HOME=$(realpath ../nccl/build)
$ make -j 8 MPI=1 MPI_HOME=${MPI_HOME} NCCL_HOME=${NCCL_HOME} CUDA_HOME=${CUDA_HOME}

We can now run an All Reduce benchmark to measure the network bandwidth for collectives. Running without the plugin, we see we have a maximum Bus Bandwidth of busbw 2.32 GB/s.

$ srun -N 2 --gpus 8 --network=disable_rdzv_get $(realpath nccl-tests/build/all_reduce_perf) -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 8 maxBytes 4294967296 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 123253 on  nid001024 device  0 [0009:01:00] NVIDIA GH200 120GB
[...]
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
[...]
  8589934592    2147483648     float     sum      -1  6492425    1.32    2.32       0  6488797    1.32    2.32       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.3806
#
# Collective test concluded: all_reduce_perf

The above shows very slow performance indicating that NCCL is not using RDMA and is instead using a TCP Socket interface. To enable RDMA, we will source the environment variables above and add the plugin to our environment:

$ export LD_LIBRARY_PATH=$(realpath aws-ofi-nccl/build/lib):$LD_LIBRARY_PATH
# Ensure you source the environment variables from above
$ srun -N 2 --gpus 8 --cpus-per-task 72 --network=disable_rdzv_get nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 125488 on  nid001024 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 125488 on  nid001024 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 125488 on  nid001024 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 125488 on  nid001024 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 136269 on  nid001025 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 136269 on  nid001025 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 136269 on  nid001025 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 136269 on  nid001025 device  3 [0039:01:00] NVIDIA GH200 120GB
NCCL version 2.19.3+cuda12.3
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
[...]
  8589934592    2147483648     float     sum      -1  92400.1   92.96  162.69       0  92381.3   92.98  162.72       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 62.3908 
#
# Collective test concluded: all_reduce_perf

Here you can see that NCCL is correctly using RDMA on Slingshot, achieving an All Reduce out-of-place bus bandwidth of 162.69 GB/s.

Resources