https://arxiv.org/pdf/2509.25149v1
Summary
Training methodology
In short, the recommendation for NVFP4 training is:
- Keep a few sensitive linear layers in higher precision (15% of the network, with the majority of high precision layers at the end of the network).
- Apply Random Hadamard transforms of size 16×16 to inputs of weight gradient GEMMs.
- Use two-dimensional (2D) scaling over 16×16 blocks for weights, and one dimensional scaling over 1×16 blocks for activations and gradients.
- Use stochastic rounding for gradients and round-to-nearest-even for weights and activations.
What’s kept in high precision
- The last layers are kept in high precision
- To ensure numerical stability during training, they retain the original precision (e.g., BF16 or FP32) for
- embeddings,
- the output projection head,
- normalization layers, non-linearities
- attention components, including softmax and the query-key and attention score-value batched GEMMs.
- The main weights (stored by the optimizer), weight gradients (used for gradient accumulation across microbatches and across data-parallel replicas), and optimizer states are also kept in FP32.
- Tensor parallel reductions are performed in BF16 precision.
Linear layer sensitivity analysis
-
Although linear layers are typically computed in narrower precisions, we observe that some linear layers are more sensitive to FP4 than others.
-
In particular, training diverges when every linear layer is quantized to FP4
-
-
Based on tensor analysis, they observe the last layers tend to have larger quantization errors in the weight gradients (i.e., Wgrad output from its inputs being FP4).
-
Quantization error metrics could potentially serve as a mechanism to determine which linear layers should remain in higher precision during training.
Summary of the NVFP4 quantized linear layer recipe
- GEMM operations consume FP4 tensors as inputs and produce outputs in BF16 or FP32.