Floating point representation

The number of bits in the exponents represent the scale/range of numbers we can represent
The number of bits in the mantissa represents the precision of numbers we can represent
Because of the exponent $\times$ fraction format, there is inevitably bigger gaps in between large numbers than smaller numbers
Representation of fp32

Bits assignments, range, and granularity

float32: 1 bit sign, 8 bits exponent, 23 bits mantissa; range: $2^{- 126} \approx 1.8 \times 1 0^{- 38}$ to $2^{127} (1 + (1 - 2^{- 23})) \approx 3.4 \times 1 0^{38}$
bfloat16: 1 bit sign, 8 bits exponent, 7 bits mantissa; range: $2^{- 126} \approx 1.8 \times 1 0^{- 38}$ to $2^{127} (1 + (1 - 2^{- 7})) \approx 3.389 \times 1 0^{38}$ (more range, less granular)
FP16: 1 bit sign, 5 bits exponent, 10 bits mantissa; range: $2^{- 14} \approx 6.1 \times 1 0^{- 5}$ to $2^{15} (1 + (1 - 2^{- 10})) = 65504$ (less range, more granular)
FP8 (E4M3): 1 bit sign, 4 bits exponent, 3 bits mantissa; range: $2^{- 6} = 0.015625$ to $2^{8} (1 + (1 - 2^{- 2})) \approx 448$ .
- *WARNING: E4M3’s dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. Greater range achieved is much more useful than supporting multiple encodings for the special values
FP8 (E5M2): 1 bit sign, 5 bits exponent, 2 bits mantissa; range: $2^{- 14} \approx 6.1 \times 1 0^{- 5}$ to $2^{15} (1 + (1 - 2^{- 2})) = 57344$
Converting from fp32 to bfloat16 is easy: the exponent is kept the same and the significand is rounded or truncated from 24 bits to 8; hence overflow and underflow are not possible in the ki conversion.

In bf16, $282 + 1 = 284$
Because of this lack of precision, multiple sums can lead to overlfow in bf16, which wouldn’t happen in fp16 or fp32.

$S$ = sign bit, $E$ = exponent bits, $M$ = fraction or mantissa bits
Value = $(- 1)^{S} \times 2^{E - exponent bias} \times (1. M)$ (in most cases, special values apply e.g. zero and infinity)
exponent bias = $2^{E - 1} - 1$

To adequately represent values below 1, the exponent is encoded using an offset-binary representation, with the offset usually equal to $2^{E - 1} - 1$ (middle of the representable range), also called exponent bias. Equal to 127 for fp32 and 15 for fp16.
$E_{e ff ec t i v e} = E - exponent bias$
For fp16,
- Exponent bias = $0111 1_{2}$ =15
- $E_{min} = 0000 1_{2} - 0111 1_{2} = - 14$
- $E_{ma x} = 1111 0_{2} - 0111 1_{2} = 15$
- $E = 0000 0_{2}$ and $E = 1111 1_{2}$ are special cases.
  - When using $E = 0000 0_{2}$ , we go into SUBNORMAL REGIME and the equation changes, the smallest positive value in this regime is $2^{- 14} \times (0 + 2^{- 10}) = 2^{- 24} \approx 5.96 \times 1 0^{- 8}$
- Thus, the minimum positive normal value is $2^{- 14} \approx 6.1 \times 1 0^{- 5}$
- The maximum positive normal value (excluding infinity) is $2^{15} \times (1 + \frac{1023}{1024}) = 65504$

When training in fp16, Maximum normalized value is 65,504 and minimum normalized is 2-14= ~6.10e-5. We need to prevent underflow!
AMP/fp16 may not work for every model! For example, most bf16-pretrained models cannot operate in the fp16 numerical range of max 65504 and will cause gradients to overflow instead of underflow. In this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number representable in the fp16 dynamic range. While one may expect the scale to always be above 1, our GradScaler does NOT make this guarantee to maintain performance.
Loss scaling to shift gradient values in representable range of fp16
- 1. Maintain a primary copy of weights in FP32.
1. For each iteration:
  1. Make an FP16 copy of the weights.
  2. Forward propagation (FP16 weights and activations).
  3. Multiply the resulting loss with the scaling factor S.
  4. Backward propagation (FP16 weights, activations, and their gradients).
  5. Multiply the weight gradient with 1/S.
  6. Complete the weight update (including gradient clipping, etc.).