Skip to content
  • AIP1 Isambard-AI Phase 1 supported
  • AIP2 Isambard-AI Phase 2 supported
  • I3 Isambard 3 unsupported
  • BC5 BlueCrystal 5 unsupported

NCCL

The NVIDIA Collective Communication Library (NCCL) provides direct GPU-GPU communication for NVIDIA GPUs. This guide will show how NCCL can be used on Isambard-AI's HPE Slingshot 11 high speed network.

NCCL provides RDMA (Remote Direct Memory Access) via GPUDirect. To use this on the Slingshot network, you must use the aws-ofi-nccl network plugin.

NCCL and aws-ofi-nccl are provided on the system as follows:

$ module load brics/nccl

The rest of this guide explains optimisations to NCCL on Slingshot as they are application dependent.

Environment Variables

NCCL Environment variables are used to optimise the performance on the Slingshot network. The official NCCL documentation has in-depth explanations of the environment variables.

These variables are set in the module brics/nccl, but are explained here for application-specific optimisation or if you are building your own version.

# Force NCCL to use the aws-ofi-nccl libfabric plugin
export NCCL_NET="AWS Libfabric"
# Use the high speed network interface
export NCCL_SOCKET_IFNAME="hsn"
# Print the NCCL version at startup
export NCCL_DEBUG="VERSION"
# Use P2P when GPUs share the same NUMA node
export NCCL_NET_GDR_LEVEL="PHB"
# Allow rings/trees to span multiple NICs
export NCCL_CROSS_NIC="1"
export NCCL_MIN_NCHANNELS="4"
export NCCL_GDRCOPY_ENABLE="1"
export NCCL_NET_FORCE_FLUSH="0"

# Libfabric (FI) tuning for Slingshot
export FI_CXI_DEFAULT_CQ_SIZE="131072"
export FI_CXI_DEFAULT_TX_SIZE="2048"
export FI_CXI_DISABLE_NON_INJECT_MSG_IDC="1"
export FI_HMEM_CUDA_USE_GDRCOPY="1"
export FI_CXI_DISABLE_HOST_REGISTER="1"
export FI_MR_CACHE_MONITOR="userfaultfd"
export FI_CXI_RDZV_PROTO="alt_read"
export FI_CXI_RDZV_THRESHOLD="0"
export FI_CXI_RDZV_GET_MIN="0"
export FI_CXI_RDZV_EAGER_SIZE="0"
export FI_CXI_RX_MATCH_MODE="hybrid"

Click here to download the file: env_vars.sh

Some optimisations are application dependent

The default system settings optimise for a wide range of scenarios. However, it is important to note the optimisations are application dependent, namely NCCL_MIN_NCHANNELS, FI_CXI_DEFAULT_CQ_SIZE, and FI_CXI_DEFAULT_TX_SIZE.

For more information see The official NCCL documentation on environment variables.

NCCL and AWS Libfabric

The environment variable NCCL_NET="AWS Libfabric" will force NCCL to use the aws-ofi-nccl plugin. If the plugin isn't setup in your environment you will get the following error:

NCCL WARN Error: network AWS Libfabric not found.

Debugging

When debugging NCCL ensure you set NCCL_DEBUG=INFO to get the NCCL debug logs.

Build NCCL

Installing nccl

If you need a specific version of NCCL - or you are trying to build a custom container, you can build NCCL with the following instructions:

$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v2.27.7-1" https://github.com/NVIDIA/nccl.git
$ cd nccl
$ mkdir build
$ make -j 8 src.build PREFIX=$(realpath build)

Installing aws-ofi-nccl

The NCCL network plugin aws-ofi-nccl must be built for NCCL to use the Slingshot high speed network. It depends on CUDA and Libfabric. Ensure you use a version that is tagged with aws. The network plugin can be installed as follows:

$ module load cudatoolkit PrgEnv-gnu
$ git clone --branch "v1.7.x-aws" https://github.com/aws/aws-ofi-nccl.git
$ cd aws-ofi-nccl
$ ./autogen.sh 
$ export LIBFABRIC_HOME=/opt/cray/libfabric/1.22.0 CC=/usr/bin/gcc-12 CXX=/usr/bin/g++-12
$ mkdir build
$ ./configure --prefix=$(realpath build) --with-cuda=${CUDA_HOME} --with-libfabric=${LIBFABRIC_HOME} --disable-tests
$ make -j 8 install

Benchmark NCCL with nccl-tests

NCCL tests depends on MPI, NCCL, and CUDA. We can build it with the following instructions:

$ module load cudatoolkit PrgEnv-gnu
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ export MPI_HOME=/opt/cray/pe/mpich/8.1.32/ofi/gnu/12.3/ 
# Make sure you set NCCL_HOME to your build directory
$ export NCCL_HOME=$(realpath ../nccl/build)
$ make -j 8 MPI=1 MPI_HOME=${MPI_HOME} NCCL_HOME=${NCCL_HOME} CUDA_HOME=${CUDA_HOME}

We can now run an All Reduce benchmark to measure the network bandwidth for collectives. Running without the plugin, we see we have a maximum Bus Bandwidth of busbw 2.32 GB/s.

$ srun -N 2 --gpus 8 --network=disable_rdzv_get $(realpath nccl-tests/build/all_reduce_perf) -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 8 maxBytes 4294967296 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 123253 on  nid001024 device  0 [0009:01:00] NVIDIA GH200 120GB
[...]
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
[...]
  8589934592    2147483648     float     sum      -1  6492425    1.32    2.32       0  6488797    1.32    2.32       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.3806
#
# Collective test concluded: all_reduce_perf

The above shows very slow performance indicating that NCCL is not using RDMA and is instead using a TCP Socket interface. To enable RDMA, we will source the environment variables above and add the plugin to our environment:

$ export LD_LIBRARY_PATH=$(realpath aws-ofi-nccl/build/lib):$LD_LIBRARY_PATH
# Ensure you source the environment variables from above
$ srun -N 2 --gpus 8 --cpus-per-task 72 --network=disable_rdzv_get nccl-tests/build/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4
# nccl-tests version 2.17.6 nccl-headers=21903 nccl-library=21903
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 4 minBytes 32768 maxBytes 8589934592 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 125488 on  nid001024 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 125488 on  nid001024 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 125488 on  nid001024 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 125488 on  nid001024 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 136269 on  nid001025 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 136269 on  nid001025 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 136269 on  nid001025 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 136269 on  nid001025 device  3 [0039:01:00] NVIDIA GH200 120GB
NCCL version 2.19.3+cuda12.3
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
[...]
  8589934592    2147483648     float     sum      -1  92400.1   92.96  162.69       0  92381.3   92.98  162.72       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 62.3908 
#
# Collective test concluded: all_reduce_perf

Here you can see that NCCL is correctly using RDMA on Slingshot, achieving an All Reduce out-of-place bus bandwidth of 162.69 GB/s.

GPU to NIC topology information

NCCL has an option through setting the environment variable NCCL_TOPO_DUMP_FILE=topo.txt to save topology information to a file e.g. topo.txt. The commands cxi_stat and nvidia-smi -L can list the Network Interface Cards (NICs) and GPUs respectively. This can confirm which Slingshot network device is used by each GPU.

Build NCCL in a container

For container workloads, we recommend using Apptainer/Singularity. The brics/apptainer-multi-node module injects the host MPI, NCCL, aws-ofi-nccl libraries into your container enabling multi-node workloads. The NCCL and aws-ofi-nccl host libraries are built against the CUDA runtime provided by the host system. More information on this module can be found in our Apptainer/Singularity Multi-node guide.

If your container provides, or requires, a newer CUDA runtime than the host version, such as one provided in an NVIDIA GPU Cloud (NGC) image, you may need a compatible version of NCCL and the aws-ofi-nccl OFI plugin. Such container images may include a build of NCCL that is compatible with the CUDA runtime supplied - in this case we only need to build a compatible aws-ofi-nccl plugin.

To build compatible nccl and aws-ofi-nccl libraries, we will first build a base image containing our application. Then, we will run our container in order to build nccl and aws-ofi-nccl against the CUDA runtime within the container.

Note

These libraries should be built into a folder bind-mounted from the host, such as one within the home folder.

The folder containing these compiled libraries will need to be bind-mounted to the container each time it is run.

As an example, we will configure a container based on the PyTorch 26.01 NGC image to use RDMA over Slingshot with NCCL. This image provides a version of NCCL built against the CUDA runtime provided within the image, but we will build new, compatible versions of the NCCL and aws-ofi-nccl libraries. We will also build the nccl-tests benchmark suite to test that our container has been properly configured for RDMA.

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:26.01-py3

%setup
    # Copy NCCL environment variables script, downloaded above, into container
    cp env_vars.sh ${SINGULARITY_ROOTFS}/opt
%post
    apt-get update && apt-get install -y --no-install-recommends \
        build-essential git autoconf automake libtool
    apt-get clean && rm -rf /var/lib/apt/lists/*

    # This is specific to this image
    #   - Overwrite ld cache for libfabric and the OFI plugin
    #   - Ensures that host libfabric is used, and compatible our aws-ofi-nccl plugin
    #   - We will build aws-ofi-nccl into the /opt/slingshot/aws-ofi-nccl folder
    sed -i 's|/opt/amazon/efa/lib|/host/opt/cray/libfabric/1.22.0/lib64|g' /etc/ld.so.conf.d/efa.conf
    sed -i 's|/opt/amazon/aws-ofi-nccl/lib|/opt/slingshot/aws-ofi-nccl/lib|g' /etc/ld.so.conf.d/aws-ofi-nccl.conf
    ldconfig
%environment
    # Set NCCL environment variables
    . /opt/env_vars.sh
    export CUDA_HOME=/usr/local/cuda
    export NCCL_HOME=/opt/slingshot/nccl
    export LIBFABRIC_HOME=/host/opt/cray/libfabric/1.22.0
    export MPI_HOME=/usr/local/mpi
    export TMPDIR=/tmp
    export LD_LIBRARY_PATH=$LIBFABRIC_HOME/lib64:$NCCL_HOME/lib:/opt/slingshot/aws-ofi-nccl/lib:$LD_LIBRARY_PATH:/host/usr/lib64
%runscript
    exec "$@"

Click here to download the complete file: pytorch_multinode.def

Before building, also download env_vars.sh and place it in the same directory as pytorch_multinode.def. The build process requires this file to be present alongside the .def file.

We now build our .sif image:

$ mkdir $HOME/sif-images
$ srun --gpus=1 --time=00:30:00 singularity build --fakeroot $HOME/sif-images/pytorch.sif pytorch_multinode.def

Once complete, we can create a Slurm job script to build a NCCL and aws-ofi-nccl inside this container:

#!/bin/bash
#SBATCH --job-name=build-nccl
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

# Directory on host to store built libraries and binaries
mkdir $HOME/nccl_build

singularity exec --nv \
    --bind /opt/cray/libfabric/1.22.0:/host/opt/cray/libfabric/1.22.0:ro \
    --bind /usr/lib64:/host/usr/lib64:ro \
    --bind $HOME/nccl_build:/opt/slingshot \
    $HOME/sif-images/pytorch.sif bash -c '
        export CUDA_HOME=/usr/local/cuda
        export NCCL_HOME=/lib/aarch64-linux-gnu
        export LIBFABRIC_HOME=/host/opt/cray/libfabric/1.22.0
        export MPI_HOME=/usr/local/mpi
        export TMPDIR=/tmp

        # Build nccl library
        cd /tmp && LD_LIBRARY_PATH=/lib/aarch64-linux-gnu git clone --branch "v2.29.2-1" https://github.com/NVIDIA/nccl.git
        cd /tmp/nccl && mkdir /opt/slingshot/nccl
        make -j $(nproc) install src.build BUILDDIR=/opt/slingshot/nccl NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90"
        export NCCL_HOME=/opt/slingshot/nccl

        # Build hwloc library, required to build aws-ofi-nccl
        cd /tmp && LD_LIBRARY_PATH=/lib/aarch64-linux-gnu git clone --branch "v2.13" https://github.com/open-mpi/hwloc.git
        cd /tmp/hwloc && ./autogen.sh
        ./configure --disable-nvml --prefix=/opt/slingshot/hwloc
        make -j $(nproc) install

        # Build aws-ofi-nccl
        cd /tmp && LD_LIBRARY_PATH=/lib/aarch64-linux-gnu git clone --branch "v1.18.0" https://github.com/aws/aws-ofi-nccl.git
        cd /tmp/aws-ofi-nccl && ./autogen.sh
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/host/usr/lib64
        ./configure --prefix=/opt/slingshot/aws-ofi-nccl \
            --with-cuda=${CUDA_HOME} \
            --with-libfabric=${LIBFABRIC_HOME} \
            --with-mpi=${MPI_HOME} \
            --with-hwloc=/opt/slingshot/hwloc \
            --disable-tests
        make -j $(nproc) install

        # Build nccl-tests, if required - this is optional but useful for testing purposes
        cd /tmp && LD_LIBRARY_PATH=/lib/aarch64-linux-gnu git clone https://github.com/NVIDIA/nccl-tests.git
        cd /tmp/nccl-tests && make -j $(nproc) MPI=1
        cp -r /tmp/nccl-tests/build /opt/slingshot/nccl-tests
    '

Click here to download the complete file: build_nccl.sh

Now we can submit the Slurm job:

$ sbatch build_nccl.sh

When running the container, you must bind-mount the build folder, in this example $HOME/nccl_build, into the same folder in the container used in the build process (/opt/slingshot in the example) as well as the libfabric directory and /usr/lib64 (containing libcxi to allow communication with the Slingshot Cassini NICs).

$ singularity run --nv \
    --bind /opt/cray/libfabric/1.22.0:/host/opt/cray/libfabric/1.22.0:ro \
    --bind /usr/lib64:/host/usr/lib64:ro \
    --bind $HOME/nccl_build:/opt/slingshot \
    pytorch.sif bash

Benchmark NCCL in a container with nccl-tests

If you used NCCL and aws-ofi-nccl provided by the host system, you can follow the steps from our Apptainer/Singularity Multi-node guide to run nccl-tests.

If you followed the Build NCCL in a container step, you can run the NCCL All Reduce benchmark between two compute nodes:

#!/bin/bash
#SBATCH --job-name=bench-nccl
#SBATCH --nodes=2
#SBATCH --gpus=8
#SBATCH --time=00:10:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

srun -N 2 \
    --gpus 8 \
    --cpus-per-task 72 \
    --tasks-per-node 1 \
    --network=disable_rdzv_get \
    --mpi=pmi2 \
    singularity exec --nv \
    --bind /opt/cray/libfabric/1.22.0:/host/opt/cray/libfabric/1.22.0:ro \
    --bind /usr/lib64:/host/usr/lib64:ro \
    --bind $HOME/nccl_build:/opt/slingshot \
    $HOME/sif-images/pytorch.sif /opt/slingshot/nccl-tests/all_reduce_perf -b 32KB -e 8GB -f 2 -g 4

Click here to download the complete file: bench_nccl.sh

Resources