Tell how many GPUs do you need in 5 secs • Training in half mixed-precision: model_size_in_B * 18 * 1.25 / gpu_size_in_GB • Inference in half precision: model_size_in_B * 2 * 1.25 / gpu_size_in_GB That’s the minimum, more to have a bigger batch size and longer sequence length. Here is the breakdown: • Training: 8 bytes for AdamW states, 4 bytes for grads, 4+2 bytes for weights • Inference: 2 bytes for weights (1 byte if you use quantization) • 1.25 is 25% for activations (very very approximate) For example: Let’s take an 80B param model and 80GB GPUs and calculate how many of them we will need for: • Training: at least 23 GPUs 80181.25/80 • Inference: at least 3 GPUs 8021.25/80