- NVIDIA Collective Communications Library
- Provides ways for GPUs to communicate data quickly
- https://www.youtube.com/watch?v=T22e3fgit-A
- https://github.com/cuda-mode/lectures/tree/main/lecture_017
Using torch.profiler, you can get a .json trace and examine it in chrome://tracing !!! (very cool) code: https://github.com/cuda-mode/lectures/blob/main/lecture_017/ddp_example.py
CUDA streams
-
Operations within a CUDA stream must run sequentially i.e. first do all-reduce on the gradients, then do optimizer step
-
Different CUDA streams don’t have any ordering obligations
-
NCCL launches a new CUDA stream for the All-Reduces when averaging gradients
- Launches an all-reduce (on the CUDA stream separate from the backward operation) for a given layer when the layer is done doing its backward pass.
Communicator Objects
-
A communicator object is needed for calling a NCCL distributed primitive from a GPU
-
1 GPU per CPU process
- Root process generates
uniqueId
- Broadcast
uniqueId
o all processes (e.g. use MPI) - All processes initialize communicator with same id, unqiue rank
- Each process then launches the kernel (e.g. AllReduce)
- Root process generates