https://arxiv.org/pdf/2509.25149v1

Summary

Training methodology

In short, the recommendation for NVFP4 training is:

  1. Keep a few sensitive linear layers in higher precision (15% of the network, with the majority of high precision layers at the end of the network).
  2. Apply Random Hadamard transforms of size 16×16 to inputs of weight gradient GEMMs.
  3. Use two-dimensional (2D) scaling over 16×16 blocks for weights, and one dimensional scaling over 1×16 blocks for activations and gradients.
  4. Use stochastic rounding for gradients and round-to-nearest-even for weights and activations.

What’s kept in high precision

  • The last layers are kept in high precision
  • To ensure numerical stability during training, they retain the original precision (e.g., BF16 or FP32) for
    • embeddings,
    • the output projection head,
    • normalization layers, non-linearities
    • attention components, including softmax and the query-key and attention score-value batched GEMMs.
  • The main weights (stored by the optimizer), weight gradients (used for gradient accumulation across microbatches and across data-parallel replicas), and optimizer states are also kept in FP32.
  • Tensor parallel reductions are performed in BF16 precision.

Linear layer sensitivity analysis

  • Although linear layers are typically computed in narrower precisions, we observe that some linear layers are more sensitive to FP4 than others.

  • In particular, training diverges when every linear layer is quantized to FP4

  • Based on tensor analysis, they observe the last layers tend to have larger quantization errors in the weight gradients (i.e., Wgrad output from its inputs being FP4).

  • Quantization error metrics could potentially serve as a mechanism to determine which linear layers should remain in higher precision during training.

Summary of the NVFP4 quantized linear layer recipe

  • GEMM operations consume FP4 tensors as inputs and produce outputs in BF16 or FP32.