Look at “Anatomy of Model’s Memory Usage” in Stas Bekman book
There is a very handy GPU VRAM Estimator (https://vram.asmirnov.xyz/) from Alexander Smirnov
The components on GPU memory are the following:
- model weights
- optimizer states
- gradients
- forward activations saved for gradient computation
- temporary buffers
- functionality-specific memory
Training
A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory and temp memory.
Weights + Optimizer + Gradients
- To train a model, you need at least 18 bytes per parameter in mixed half precision.
- 8 bytes for Adam optimizer state (2 momentums at fp32) =
- 6 bytes for mixed half precision (fp32 copy + fp16 copy)
- 4 bytes for gradients in fp32
- If we do everything in bf16/fp16
- 4 bytes for Adam optimizer
- 2 bytes for model weights
- 2 bytes for gradients
Forward Activations
- Forward Activation
- size depends on many factors, the key ones being sequence length, hidden size and batch size.
- their size scales quadratically with sequence length (we have to store the output of a
softmax(Q×K.T)
which has Batch Size × Number of Attention Heads × Sequence Length ** 2 shape) There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.
GPU VRAM
- Tell how many GPUs do you need in 5 secs • Training in half mixed-precision: model_size_in_B * 18 * 1.25 / gpu_size_in_GB • Inference in half precision: model_size_in_B * 2 * 1.25 / gpu_size_in_GB
- That’s the minimum, more to have a bigger batch size and longer sequence length
- 1.25 is 25% for activations (very very approximate)