🤖 Harold's Notes

Search

❯

❯

❯

❯

❯

NCCL

Jul 03, 20241 min read

NVIDIA Collective Communications Library
Provides ways for GPUs to communicate data quickly
https://www.youtube.com/watch?v=T22e3fgit-A
https://github.com/cuda-mode/lectures/tree/main/lecture_017

Using torch.profiler, you can get a .json trace and examine it in chrome://tracing !!! (very cool) code: https://github.com/cuda-mode/lectures/blob/main/lecture_017/ddp_example.py

CUDA streams

Operations within a CUDA stream must run sequentially i.e. first do all-reduce on the gradients, then do optimizer step
Different CUDA streams don’t have any ordering obligations
NCCL launches a new CUDA stream for the All-Reduces when averaging gradients
- Launches an all-reduce (on the CUDA stream separate from the backward operation) for a given layer when the layer is done doing its backward pass.

Communicator Objects

A communicator object is needed for calling a NCCL distributed primitive from a GPU
1 GPU per CPU process
- Root process generates uniqueId
- Broadcast uniqueId o all processes (e.g. use MPI)
- All processes initialize communicator with same id, unqiue rank
- Each process then launches the kernel (e.g. AllReduce)

Graph View

CUDA streams
Communicator Objects

Backlinks

CUDA streams
Distributed primitives

Created with Quartz v4.2.3 © 2025