🤖 Harold's Notes

Search

❯

❯

❯

❯

Memory usage (VRAM)

Memory usage (VRAM)

Jul 03, 20242 min read

Look at “Anatomy of Model’s Memory Usage” in Stas Bekman book

There is a very handy GPU VRAM Estimator (https://vram.asmirnov.xyz/) from Alexander Smirnov

The components on GPU memory are the following:

model weights
optimizer states
gradients
forward activations saved for gradient computation
temporary buffers
functionality-specific memory

Training

A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory and temp memory.

Weights + Optimizer + Gradients

To train a model, you need at least 18 bytes per parameter in mixed half precision.
- 8 bytes for Adam optimizer state (2 momentums at fp32) = $2 * 4$
- 6 bytes for mixed half precision (fp32 copy + fp16 copy)
- 4 bytes for gradients in fp32
- $8 + 6 + 4 = 18$
If we do everything in bf16/fp16
- 4 bytes for Adam optimizer
- 2 bytes for model weights
- 2 bytes for gradients
- $4 + 2 + 2 = 8$

Forward Activations

Forward Activation
- size depends on many factors, the key ones being sequence length, hidden size and batch size.
- their size scales quadratically with sequence length (we have to store the output of a softmax(Q×K.T) which has Batch Size × Number of Attention Heads × Sequence Length ** 2 shape) There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.

GPU VRAM

Tell how many GPUs do you need in 5 secs • Training in half mixed-precision: model_size_in_B * 18 * 1.25 / gpu_size_in_GB • Inference in half precision: model_size_in_B * 2 * 1.25 / gpu_size_in_GB
That’s the minimum, more to have a bigger batch size and longer sequence length
1.25 is 25% for activations (very very approximate)

Graph View

Training
Weights + Optimizer + Gradients
Forward Activations
GPU VRAM

Backlinks

3D Parallelism

Created with Quartz v4.2.3 © 2024