Look at “Anatomy of Model’s Memory Usage” in Stas Bekman book

There is a very handy GPU VRAM Estimator (https://vram.asmirnov.xyz/) from Alexander Smirnov

The components on GPU memory are the following:

  1. model weights
  2. optimizer states
  3. gradients
  4. forward activations saved for gradient computation
  5. temporary buffers
  6. functionality-specific memory

Training

A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory and temp memory.

Weights + Optimizer + Gradients

  • To train a model, you need at least 18 bytes per parameter in mixed half precision.
    • 8 bytes for Adam optimizer state (2 momentums at fp32) =
    • 6 bytes for mixed half precision (fp32 copy + fp16 copy)
    • 4 bytes for gradients in fp32
  • If we do everything in bf16/fp16
    • 4 bytes for Adam optimizer
    • 2 bytes for model weights
    • 2 bytes for gradients

Forward Activations

  • Forward Activation
    • size depends on many factors, the key ones being sequence length, hidden size and batch size.
    • their size scales quadratically with sequence length (we have to store the output of a softmax(Q×K.T) which has Batch Size × Number of Attention Heads × Sequence Length ** 2 shape) There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.

GPU VRAM

  • Tell how many GPUs do you need in 5 secs • Training in half mixed-precision: model_size_in_B * 18 * 1.25 / gpu_size_in_GB • Inference in half precision: model_size_in_B * 2 * 1.25 / gpu_size_in_GB
  • That’s the minimum, more to have a bigger batch size and longer sequence length
  • 1.25 is 25% for activations (very very approximate)