In practice

  • Let’s say you have 128 GPUs with 8 GPUs per node. Let’s say 64 GPUs are required for one batch (70B model). Then you have

    • Data parallelism , =2.
    • We want to tensor parallelism because we have 8 GPUs per node
    • Thus, we have pipeline parallelism where .
    • This means that the layers are distributed among 8 nodes, and each node distributes the tensor computation between 8 GPUs.
  • How to compute the needed number of GPUs for a batch ?

    • If we do mixed-precision training with a 70B model, as explained in Memory usage (VRAM) we need num_GPUS = model_size_in_B * 18 * 1.25 / gpu_size_in_GB
    • For a 70B model and A100s with 80GB RAM, this gives us 19.68 GPUs
      • Could be 8.75 GPUs if using only bf16/fp16
    • 1.25 is very rough estimate for the activation overhead, actual overhead is dependent linearly on batch and quadratically on sequence length.
    • We can reduce this by using Activation checkpointing in a smart manner + doing tensor and sequence parallelism.

How to choose your 3D Parallelism (ZeRO-3 or ZeRO-1+PP+TP)

  • Increasing TP and PP implicitly β€œshards” the model across GPUs, thus, it is quite memory efficient
    • Main constraint is that TP is fairly communication intensive, and thus should usually stay within the boundary of a single node, to only use inter-node communication
      • *This might become irrelevant as intra-node networking performance approaches inter-node *
    • Thus, the maximum granularity at which we can shard a model is the number of GPUS in a node.
      • Depending on the size, a single transformer layer may not fit within a node
  • In the case of extreme size, we may have to use ZeRO-3, as it allows for arbitrary size model.
    • For a model of size and devices, we just need

# ZeRO Data Parallelism + Pipeline Parallelism + Tensor Parallelism

  • When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding).

    • At the end of the training iteration,
      • Each process sends parameters of volume to all the processes β‡’ volume
  • In DeepSpeed, Pipeline Parallelism can work with ZeRO stage 1 but not stage 2

    • This is due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.
    • Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.
    • Indeed, in PP, each device accumulates the gradients correspond to its layers across the microbatches.
    • When replicating the pipeline across multiple clusters of nodes to do DP, each pipeline needs to hold on to its gradients throughout the training iteration to be able to do the backward passes appropriately (communicating the gradients in between the boundaries)

PTD-P (Megatron-LM)

  • β€œEfficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM”
  • Data parallelism + Tensor Parallelism + Interleaved 1F1B Pipeline Parallelism

Notations

β€’ (𝑝, 𝑑, 𝑑): Parallelization dimensions. 𝑝 for the pipeline-model- parallel size, 𝑑 for the tensor-model-parallel size, and 𝑑 for the data-parallel size. β€’ 𝑛: Number of GPUs. We require 𝑝 Β· 𝑑 Β· 𝑑 = 𝑛. β€’ 𝐡: Global batch size (provided as input). β€’ 𝑏: Microbatch size. β€’ π‘š = 1/𝑏 Β· 𝐡/𝑑 : Number of microbatches in a batch per pipeline

Tensor and Pipeline Model Parallelism interactions

Bubble size

  • As stated in Pipeline Parallelism, using pipeline parallelism with periodic flushes results in a pipeline bubble of size
  • Let’s assume (data-parallel), consequently
  • The pipeline bubble size in terms of is:
  • As increases, the pipeline bubble thus decreases for fixed , and
    • Indeed, the pipeline depth decreases if we have more tensor-parallelism.

Communication

  • The amount of communication performed between different GPUs is also affected by the values of and .

  • Pipeline model parallelism features cheaper point-to-point communication.

    • With pipeline parallelism, the total amount of communication that needs to be performed between every pair of consecutive devices (for either the forward or backward pass) for each microbatch is , where 𝑠 is the sequence length and β„Ž is the hidden size
  • Tensor model parallelism, on the other hand, uses all-reduce communication (two all-reduce operations each in the forward and backward pass, two all-gather)

    • With tensor model parallelism, tensors of total size π‘π‘ β„Ž need to be all-reduced among 𝑑 model replicas twice each in the forward and backward pass for each layer (MLP= 2 forward), leading to a total communication of per layer per device for each micro-batch.
    • Each device typically has multiple layers; the total amount of tensor-parallel-communication per device for each microbatch is then , where is the number of layers in a pipeline stage
  • We see that tensor model parallelism increases the amount of communication between devices.

Takeaway #1

  • When considering different forms of model parallelism, tensor model parallelism should generally be used up to degree 𝑔 when using 𝑔-GPU servers, and then pipeline model parallelism can be used to scale up to larger models across servers.