TFLOPS

  • how many floating point operations their hardware can computes per secs (TFLOPS)

  • bf16 A100 80GB ~ 321 TFLOPS

  • Generally expect to get ~50% of TFLOPS in real world cluster

  • So a general rule of thumb for when you prepare for a massive model training - ask around what’s the top TFLOPS one can expect to get with a given accelerator on a multi-node setup with the specified precision - and optimize until you get close to that. Once you did stop optimizing and start training.

  • Modern machine learning accelerators all have hardware specialized for matrix-multiplication, such as Nvidia’s “Tensor Cores”

    • if you aren’t doing matrix multiplication, you’ll only be able to achieve 19.5 teraflops instead of the stated 312
Calculating TFLOPS
  • When calculating TFLOPS it’s important to remember that the math is different if Gradient checkpointing are enabled, since when it’s activated more compute is used and it needs to be taken into an account. Usually the cost is of an additional forward path, but recently better methods have been found that saves some of that recomputation.
  • For decoder transformer models the following is an estimation formula which slightly under-reports the real TFLOPS:
TFLOPS: model_size_in_B * 4 * 2 * seqlen * global_batch_size / (time_in_sec_per_interation * total_gpus * 1e3)
  • The factor of 4 is used with activation/gradient checkpointing, otherwise it will be 3. For 100B+ models, activation checkpointing will almost always be on.
  • The exact formula is in Equation 3 of Section 5.1 of the Efficient Large-Scale Language Model Training on GPU Clusters using Megatron-LM

Model Flops Utilization (MFU)

As mentioned in the previous section, some (most?) vendors publish unrealistic peak performance TFLOPS - they aren’t possible to achieve. Model Flops Utilization (MFU) is the metric that tells us how well the accelerator is utilized. Here is how it is calculated:

  1. Measure the actual TFLOPS by calculating how many floating point operations a single training iteration takes and dividing that number by the number of seconds this iteration took.
  2. Divide the actual TFLOPS by advertised TFLOPS to get the MFU

Example: Let’s say you’re training in BFLOAT16 precision: • If a single iteration requires 624 Tera floating point operations and it took 4 secs to run then we know that we get: 624/4=156 actual TFLOPS • now BF16@A100 is advertised as 312TFLOPS so 156/312=0.5 gives us 50% MFU. Practically: • with NVIDIA GPUs if you’re above 50% MFU on a multi-node setup with a large model you’re already doing fantastic