Summary

Training methodology

In short, the recommendation for NVFP4 training is:

Keep a few sensitive linear layers in higher precision (15% of the network, with the majority of high precision layers at the end of the network).
Apply Random Hadamard transforms of size 16×16 to inputs of weight gradient GEMMs.
Use two-dimensional (2D) scaling over 16×16 blocks for weights, and one dimensional scaling over 1×16 blocks for activations and gradients.
Use stochastic rounding for gradients and round-to-nearest-even for weights and activations.

The last layers are kept in high precision
To ensure numerical stability during training, they retain the original precision (e.g., BF16 or FP32) for
- embeddings,
- the output projection head,
- normalization layers, non-linearities
- attention components, including softmax and the query-key and attention score-value batched GEMMs.
The main weights (stored by the optimizer), weight gradients (used for gradient accumulation across microbatches and across data-parallel replicas), and optimizer states are also kept in FP32.
Tensor parallel reductions are performed in BF16 precision.

Although linear layers are typically computed in narrower precisions, we observe that some linear layers are more sensitive to FP4 than others.
In particular, training diverges when every linear layer is quantized to FP4
Based on tensor analysis, they observe the last layers tend to have larger quantization errors in the weight gradients (i.e., Wgrad output from its inputs being FP4).
Quantization error metrics could potentially serve as a mechanism to determine which linear layers should remain in higher precision during training.

GEMM operations consume FP4 tensors as inputs and produce outputs in BF16 or FP32.