Using torch.profiler, you can get a .json trace and examine it in chrome://tracing !!! (very cool) code: https://github.com/cuda-mode/lectures/blob/main/lecture_017/ddp_example.py

CUDA streams

  • Operations within a CUDA stream must run sequentially i.e. first do all-reduce on the gradients, then do optimizer step

  • Different CUDA streams don’t have any ordering obligations

  • NCCL launches a new CUDA stream for the All-Reduces when averaging gradients

    • Launches an all-reduce (on the CUDA stream separate from the backward operation) for a given layer when the layer is done doing its backward pass.

Communicator Objects

  • A communicator object is needed for calling a NCCL distributed primitive from a GPU

  • 1 GPU per CPU process

    • Root process generates uniqueId
    • Broadcast uniqueId o all processes (e.g. use MPI)
    • All processes initialize communicator with same id, unqiue rank
    • Each process then launches the kernel (e.g. AllReduce)