What are the pros and cons of 3D parallelism (DP + TP +PP) v.s. zero-3 parallelism?

Memory

Activation memory

  • Activation memory of a transformer layer (for 3D parallelism)
    • very little overhead (5x reduction in memory, 4% overhead in forward+backward ⇒ 30% increase in throughput via increased batch size)
  • Zero-3
    • naive ZeRO-3 doesn’t allow for such optimization, need to selective recomputation but still need to eat the cost at each GPU
    • however what they do is they don’t store the full activation on a single GPU (the one corresponding to the layer), and instead partition it among all GPUs
    • This means that after the first backward layer is computed, all GPUs can free their part of the partition for this specific layer.
    • Potentially if the activations were offloaded to CPU, you can pre-fetch the activations for the next layer, while computing a given backward.

Model memory

  • Increasing TP and PP implicitly “shards” the model across GPUs, thus, it is quite memory efficient

    • Main constraint is that TP is fairly communication intensive, and thus should usually stay within the boundary of a single node, to only use intra-node communication
      • *This might become irrelevant as inter-node networking performance approaches intra-node *
    • Thus, the maximum granularity at which we can shard a model is the number of GPUS in a node.
      • Depending on the size, a single transformer layer may not fit within a node
  • In the case of extreme size, we may have to use ZeRO-3, as it allows for arbitrary size model.

    • For a model of size and devices, we just need
  • We can use zero-1 (sharding parameter states only) with 3D parallelism

    • brings most of the memory reduction from ZeRO ()
    • not zero-2 or zero-3 because
      • This is due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.
      • Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.
      • Indeed, in PP, each device accumulates the gradients correspond to its layers across the microbatches.
      • When replicating the pipeline across multiple clusters of nodes to do DP, each pipeline needs to hold on to its gradients throughout the training iteration to be able to do the backward passes appropriately (communicating the gradients in between the boundaries)

Communication

  • Tensor together with sequence parallelism requires four all-gathers and four reduce-scatters in a single forward and backward pass.

    • 3% overhead over normal forward + backward (according to reducing activation memory paper)
    • may not need to use DP, depending on your sharding choices
  • All reduce communication volume for gradients in typical DP = (for a single training iteration)

  • every GPU holds a slice of the model

    • At forward time, the responsible GPU broadcasts so that everyone can compute on the slice of the mode
    • Communication Volume with ZeRO-3 (50% increase in communication)

Ease of usage

  • More obvious
    • supposedly, TP+PP will get you similar memory savings to zero-3 but at lesser cost of communication (only inter-node coordination) vs. intra-node exchange of parameters for each part of the forward.

      • might get less true as intra-node bandwidth comes closer to inter-node
    • defining a model is a lot more tricky in with TP+PP

      • less flexibility to move away from trusted and tried GPT
      • or pay the cost to write the new module while keeping TP in mind

Non-obvious

  • low precision is harder to get right in 3D parallelism, because of all the reduce_scatter_sum you need to do when you do TP