In practice

Letβs say you have 128 GPUs with 8 GPUs per node. Letβs say 64 GPUs are required for one batch (70B model). Then you have
 Data parallelism , $d$ =2.
 We want to tensor parallelism $t=8$ because we have 8 GPUs per node
 Thus, we have pipeline parallelism where $p=128/(8β2)=8$.
 This means that the layers are distributed among 8 nodes, and each node distributes the tensor computation between 8 GPUs.

How to compute the needed number of GPUs for a batch ?
 If we do mixedprecision training with a 70B model, as explained in Memory usage (VRAM) we need
num_GPUS = model_size_in_B * 18 * 1.25 / gpu_size_in_GB
 For a 70B model and A100s with 80GB RAM, this gives us 19.68 GPUs
 Could be 8.75 GPUs if using only bf16/fp16
 1.25 is very rough estimate for the activation overhead, actual overhead is dependent linearly on batch and quadratically on sequence length.
 We can reduce this by using Activation checkpointing in a smart manner + doing tensor and sequence parallelism.
 If we do mixedprecision training with a 70B model, as explained in Memory usage (VRAM) we need
How to choose your 3D Parallelism (ZeRO3 or ZeRO1+PP+TP)
 Increasing TP and PP implicitly βshardsβ the model across GPUs, thus, it is quite memory efficient
 Main constraint is that TP is fairly communication intensive, and thus should usually stay within the boundary of a single node, to only use internode communication
 *This might become irrelevant as intranode networking performance approaches internode *
 Thus, the maximum granularity at which we can shard a model is the number of GPUS in a node.
 Depending on the size, a single transformer layer may not fit within a node
 Main constraint is that TP is fairly communication intensive, and thus should usually stay within the boundary of a single node, to only use internode communication
 In the case of extreme size, we may have to use ZeRO3, as it allows for arbitrary size model.
 For a model of size $Ξ¨$ and $N_{d}$ devices, we just need $N_{d}16Ξ¨β<GPUΒ RAM$
# ZeRO Data Parallelism + Pipeline Parallelism + Tensor Parallelism

When ZeRODP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding).
 At the end of the training iteration,
 Each process sends parameters of volume $N_{d}Ξ¨β$ to all the $N_{d}$ processes β $Ξ¨$ volume
 At the end of the training iteration,

In DeepSpeed, Pipeline Parallelism can work with ZeRO stage 1 but not stage 2
 This is due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.
 Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.
 Indeed, in PP, each device accumulates the gradients correspond to its layers across the $m$ microbatches.
 When replicating the pipeline across multiple clusters of nodes to do DP, each pipeline needs to hold on to its gradients throughout the training iteration to be able to do the backward passes appropriately (communicating the gradients in between the boundaries)
PTDP (MegatronLM)
 βEfficient LargeScale Language Model Training on GPU Clusters Using MegatronLMβ
 Data parallelism + Tensor Parallelism + Interleaved 1F1B Pipeline Parallelism
Notations
β’ (π, π‘, π): Parallelization dimensions. π for the pipelinemodel parallel size, π‘ for the tensormodelparallel size, and π for the dataparallel size. β’ π: Number of GPUs. We require π Β· π‘ Β· π = π. β’ π΅: Global batch size (provided as input). β’ π: Microbatch size. β’ π = 1/π Β· π΅/π : Number of microbatches in a batch per pipeline
Tensor and Pipeline Model Parallelism interactions
Bubble size
 As stated in Pipeline Parallelism, using pipeline parallelism with periodic flushes results in a pipeline bubble of size $m(pβ1)β$
 Letβs assume $d=1$ (dataparallel), consequently $tβp=n$
 The pipeline bubble size in terms of $t$ is: $m(pβ1)β=mn/tβ1β$
 As $t$ increases, the pipeline bubble thus decreases for fixed $B,b$, and $d$
 Indeed, the pipeline depth decreases if we have more tensorparallelism.
Communication

The amount of communication performed between different GPUs is also affected by the values of $p$ and $t$.

Pipeline model parallelism features cheaper pointtopoint communication.
 With pipeline parallelism, the total amount of communication that needs to be performed between every pair of consecutive devices (for either the forward or backward pass) for each microbatch is $bsh$, where π is the sequence length and β is the hidden size

Tensor model parallelism, on the other hand, uses allreduce communication (two allreduce operations each in the forward and backward pass, two allgather)
 With tensor model parallelism, tensors of total size ππ β need to be allreduced among π‘ model replicas twice each in the forward and backward pass for each layer (MLP= 2 forward), leading to a total communication of $8bsh(ttβ1β)$ per layer per device for each microbatch.
 Each device typically has multiple layers; the total amount of tensorparallelcommunication per device for each microbatch is then $l_{stage}β8bsh(ttβ1β)$, where $l_{stage}$ is the number of layers in a pipeline stage

We see that tensor model parallelism increases the amount of communication between devices.
Takeaway #1
 When considering different forms of model parallelism, tensor model parallelism should generally be used up to degree π when using πGPU servers, and then pipeline model parallelism can be used to scale up to larger models across servers.