• These primitives can either be done from a root node, or you can do all-to-all communication to achieve the same results. We’ll assume root node for simplicity.

In theory

  • Broadcast: data from root all other nodes in the network

  • Reduce: aggregation of data from all nodes into root node

  • Gather: concatenation of data from all nodes into root node

  • Scatter: inverse of gather, distribute distinct pieces of data from root node to all nodes

  • Barrier: synchronization primitive that forces all nodes to wait until every node has reached the barrier point.

    • Example: the flush part of PipeDream-Flush, where we do all-reduce gradients and every node updates its weights.
  • All-{operation}: first do {operation} on root node, then do broadcast to all other nodes

  • {Operation}-scatter: first do {operation} on root node, then scatter.

In practice

  • AllReduce can be implemented in multiple ways
  • NCCL picks the best one for you, depending on network topology
    • Ring AllReduce = (ReduceScatter + AllGather)
      • ReduceScatter (reduce in a ring)
      • AllGather (propagate result in a ring)
    • Bucketed Ring All-Reduce allows for communication-computation overlap
      • e.g. when the last layer gradients are computed via the backward pass, we can start sending it along the ring, while the rest of the backward pass is computed.
      • This is launched in a separate CUDA stream, as seen in NCCL

1. All-Reduce

  • function: aggregating gradients or matrices across all nodes.
  • It performs two main functions:
    • a reduction operation (like sum, max, or average)
    • followed by a broadcast of the result to all participating nodes.
  • The goal is to ensure that, after the operation, every node ends up with the same aggregate value.

Example: If four nodes are training a model and each node calculates its own gradient, all-reduce will sum up these gradients across all nodes and then distribute the result back to each node. This way, each node has the same, updated gradient for the next step in training.

2. All-Gather

All-gather

  • function: collects specific data from each node and gathers it into a larger data structure that is then shared with all nodes.

Example: If each of four nodes holds a piece of a dataset or a segment of a vector, all-gather will combine these pieces into a complete dataset or vector and replicate it across each node, so every node ends up with the full dataset or vector.

3. Reduce-Scatter

Reduce-scatter

  • function: combination of reduction and scattering operations.
  • First, it performs a reduction operation on data collected from all nodes (like summing up elements)
  • but instead of broadcasting the same result to all nodes, it scatters the results, distributing segments of the reduced data to each node.

Example: If four nodes each have a vector of gradients, reduce-scatter might sum up these vectors and then split the resulting vector into four parts, distributing each part to one of the four nodes. Thus, each node receives only a segment of the reduced data, not the entire sum.