• Slightly different flavour of Model Parallelism
  • The main idea is to split one minibatch into multiple micro batches such that
    • every worker processes the next micro batch after having sent its activations (forward) / gradients (backward) of the previous micro batch to the next worker
  • How these passes are scheduled and how the gradients are aggregated vary in different approaches.
  • The number of partitions (workers) is also known as pipeline depth.

Counting bubbles

  • In general, the more workers, the more bubbles. The more microbatches, the less bubbles.

  • Pipeline bubble size / Bubble time fraction =

    • The true pipeline time is the actual time it takes to get all the gradients i.e. do a full training iteration
    • The ideal pipeline time is the time to do one forward + backward on a microbatch, multiplied by the number of microbatches
    • The pipeline bubble size is not the number of empty spaces in a diagram

Notation

  • Let be the number of devices in the pipeline, and the number of microbatches.
  • Let be the ideal time per iteration (assuming ideal scaling), and the time to execute a single microbatch forward and backward pass as and .
  • Let

Computation

  • We have
  • In G-Pipe and 1F1B, the last device must wait forward passes before processing its first forward pass, and then the first device must wait backward passes before doing its last forward pass.
    • Thus,
  • The pipeline bubble size is

GPipe

  • In GPipe (Huang et al. 2019), gradients from multiple microbatches are aggregated and applied synchronously at the end. The synchronous gradient descent guarantees learning consistency and efficiency irrespective of the number of workers.

    • 4 workers, 4 microbatches
    • Fraction of bubbles: given evenly split microbatches and partitions, assuming both forward and backward per microbatch take one unitn of time, the fraction of bubble is:
    • The GPipe paper observed that the bubble overhead is almost negligible if the number of microbatches is more than 4x the number of partitions (when Activation checkpointing is applied).
    • Almost linear speedup in throughput with the number of devices, although it is not always guaranteed if the model parameters are not evenly distributed across workers.
  • For large microbatches, this approach has a high memory footprint, as it requires cached intermediate activations (or just input activations when using activation recomputation) to be kept in memory for all microbatches through the lifetime of a training iteration

PipeDream (1F1B)

  • Qualitatively better than G-Pipe, same amount of bubbles but reduces memory constraint in the depth of the pipeline, instead of the number of microbatches

  • PipeDream (Narayanan et al. 2019) schedules each worker to alternatively process the forward and backward passes (1F1B)

    • Same amount of bubbles as G-Pipe but this schedule limits the number of in-flight microbatches (the number of microbatches for which the backward pass is outstanding and activations need to be maintained) to the depth of the pipeline, instead of the number of microbatches
  • PipeDream names each model partition “stage” and each stage worker can have multiple replicas to run data parallelism

  • Diagram

    • PipeDream uses a deterministic round-robin load balancing strategy to assign work among multiple replicas of stages to ensure that the forward and backward passes for the same minibatch happen on the same replica (otherwise communication is needed)
      • The number representes the current micro-batch processed
      • Note the diagonal from high-left to low-right, the forward passes are always done right after each other while idling after a forward, the stage does a backward this is why it’s called 1F1B (one forward- one backward)

How to split layers ?

  • At the beginning of a training run, PipeDream first profiles the computation memory cost and time of each layer in the model and then optimizes a solution for partitioning layers into stages, which is a dynamic programming problem.

PipeDream does not have an end-of-batch global gradient sync across all the workers

  • A native implementation of 1F1B can easily lead to the forward and backward passes of one microbatch using different versions of model weights, thus lowering the learning efficiency

  • Possible Designs to tackle the issue:

    • (OLD) Weight stashing: Each worker keeps track of several model versions and makes sure that the same version of weights are used in the forward and backward passes given one data batch (EXPENSIVE MEMORY WISE)

    • (OLD) Vertical sync (Optional): The version of model weights flows between stage workers together with activations and gradients. Then the computation adopts the corresponding stashed version propagated from the previous worker. This process keeps version consistency across workers. Note that it is asynchronous, different from GPipe. (EXPENSIVE COMMUNICATION WISE)

    • (NEW) PipeDream-flush adds a globally synchronized pipeline flush periodically, just like GPipe. (NEED TO DEFINE WHEN TO SYNCHRONIZE, usually after a full batch) In this way, it greatly reduces the memory footprint (i.e. only maintain a single version of model weights) by sacrificing a little throughput.

    • (NEW) PipeDream-2BW maintains only two versions of model weights, where “2BW” is short for “double-buffered weights”. It generates a new model version every microbatches and should be larger than the pipeline depth , . A newly updated model version cannot fully replace the old version immediately since some leftover backward passes still depend on the old version. In total only two versions need to be saved so the memory cost is much reduced.

Interleaved 1F1B (used by Megatron-LM)

Explanation

  • Extension of 1F1B to reduce the size of the pipeline bubble.

    • They use PipeDream-Flush with a twist
  • Each device can perform computation for multiple subsets of layers (called a model chunk), instead of a single contiguous set of layers.

    • For example, if each device had 4 layers before (i.e., device 1 had layers 1 − 4, device 2 had layers 5 − 8, and so on)
    • we could have each device perform computation for two model chunks (each with 2 layers), i.e., device 1 has layers 1, 2, 9, 10; device 2 has layers 3, 4, 11, 12; and so on.
    • With this scheme, each device is equally spread over the computing graph, and this reduces bubbles.
  • This schedule requires the number of microbatches in a batch to be an integer multiple of the degree of pipeline parallelism (number of devices in the pipeline)

    • For example, with 4 devices, the number of microbatches in a batch must be a multiple of 4.
  • 1F1B to Interleaved 1F1B pipeline schedules

    • Number of chunks per device is 2. Dark colors show the first chunk and light colors show the second chunk. The numbers indicate the microbatch number.
    • Total number of microbatches is 8 (that’s when the flush happens.)

Performance

  • Compared to PipeDream-Flush, the pipeline flush happens sooner

  • Let be the number of model chunks within a single device

    • i.e. if device 1 has layers 1-2, and layers 9-10, then
    • Now each forward and backward time for a microbatch for a chunk is and compared to usual 1F1B.
  • *Less bubbles: ***This means that the new schedule reduces the bubble time by i.e.

  • More communication: The amount of communication also increases by .