Distributed primitives

These primitives can either be done from a root node, or you can do all-to-all communication to achieve the same results. We’ll assume root node for simplicity.

In theory

Broadcast: data from root → all other nodes in the network
Reduce: aggregation of data from all nodes into root node
Gather: concatenation of data from all nodes into root node
Scatter: inverse of gather, distribute distinct pieces of data from root node to all nodes
Barrier: synchronization primitive that forces all nodes to wait until every node has reached the barrier point.
- Example: the flush part of PipeDream-Flush, where we do all-reduce gradients and every node updates its weights.
All-{operation}: first do {operation} on root node, then do broadcast to all other nodes
{Operation}-scatter: first do {operation} on root node, then scatter.

In practice

AllReduce can be implemented in multiple ways
NCCL picks the best one for you, depending on network topology
- Ring AllReduce = (ReduceScatter + AllGather)
  - ReduceScatter (reduce in a ring)
  - AllGather (propagate result in a ring)
- Bucketed Ring All-Reduce allows for communication-computation overlap
  - e.g. when the last layer gradients are computed via the backward pass, we can start sending it along the ring, while the rest of the backward pass is computed.
  - This is launched in a separate CUDA stream, as seen in NCCL

1. All-Reduce

function: aggregating gradients or matrices across all nodes.
It performs two main functions:
- a reduction operation (like sum, max, or average)
- followed by a broadcast of the result to all participating nodes.
The goal is to ensure that, after the operation, every node ends up with the same aggregate value.

Example: If four nodes are training a model and each node calculates its own gradient, all-reduce will sum up these gradients across all nodes and then distribute the result back to each node. This way, each node has the same, updated gradient for the next step in training.

2. All-Gather

All-gather

function: collects specific data from each node and gathers it into a larger data structure that is then shared with all nodes.

Example: If each of four nodes holds a piece of a dataset or a segment of a vector, all-gather will combine these pieces into a complete dataset or vector and replicate it across each node, so every node ends up with the full dataset or vector.

3. Reduce-Scatter

Reduce-scatter

function: combination of reduction and scattering operations.
First, it performs a reduction operation on data collected from all nodes (like summing up elements)
but instead of broadcasting the same result to all nodes, it scatters the results, distributing segments of the reduced data to each node.

Example: If four nodes each have a vector of gradients, reduce-scatter might sum up these vectors and then split the resulting vector into four parts, distributing each part to one of the four nodes. Thus, each node receives only a segment of the reduced data, not the entire sum.

🤖 Harold's Notes

Explorer

Distributed primitives

In theory

In practice

1. All-Reduce

2. All-Gather

3. Reduce-Scatter

Graph View

Table of Contents

Backlinks