TFLOPS

how many floating point operations their hardware can computes per secs (TFLOPS)

bf16 A100 80GB ~ 321 TFLOPS

Generally expect to get ~50% of TFLOPS in real world cluster

So a general rule of thumb for when you prepare for a massive model training  ask around whatâ€™s the top TFLOPS one can expect to get with a given accelerator on a multinode setup with the specified precision  and optimize until you get close to that. Once you did stop optimizing and start training.

Modern machine learning accelerators all have hardware specialized for matrixmultiplication, such as Nvidiaâ€™s â€śTensor Coresâ€ť
 if you arenâ€™t doing matrix multiplication, youâ€™ll only be able to achieve 19.5 teraflops instead of the stated 312
Calculating TFLOPS
 When calculating TFLOPS itâ€™s important to remember that the math is different if Gradient checkpointing are enabled, since when itâ€™s activated more compute is used and it needs to be taken into an account. Usually the cost is of an additional forward path, but recently better methods have been found that saves some of that recomputation.
 For decoder transformer models the following is an estimation formula which slightly underreports the real TFLOPS:
TFLOPS: model_size_in_B * 4 * 2 * seqlen * global_batch_size / (time_in_sec_per_interation * total_gpus * 1e3)
 The factor of 4 is used with activation/gradient checkpointing, otherwise it will be 3. For 100B+ models, activation checkpointing will almost always be on.
 The exact formula is in Equation 3 of Section 5.1 of the Efficient LargeScale Language Model Training on GPU Clusters using MegatronLM
Model Flops Utilization (MFU)
As mentioned in the previous section, some (most?) vendors publish unrealistic peak performance TFLOPS  they arenâ€™t possible to achieve. Model Flops Utilization (MFU) is the metric that tells us how well the accelerator is utilized. Here is how it is calculated:
 Measure the actual TFLOPS by calculating how many floating point operations a single training iteration takes and dividing that number by the number of seconds this iteration took.
 Divide the actual TFLOPS by advertised TFLOPS to get the MFU
Example: Letâ€™s say youâ€™re training in BFLOAT16 precision: â€˘ If a single iteration requires 624 Tera floating point operations and it took 4 secs to run then we know that we get: 624/4=156 actual TFLOPS â€˘ now BF16@A100 is advertised as 312TFLOPS so 156/312=0.5 gives us 50% MFU. Practically: â€˘ with NVIDIA GPUs if youâ€™re above 50% MFU on a multinode setup with a large model youâ€™re already doing fantastic