Rounding

Whenever a low-precision dtype is used one has to be careful not to accumulate intermediary results in that dtype.
LayerNorm-like operations must not do their work in half-precision, or they may lose a lot of data.
Generally it’s just the accumulation that is done in fp32, since adding up many low-precision numbers is very lossy otherwise.
Default mode for IEEE: round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)

Rewrite the smaller number such that its exponent matches with the exponent of the larger number.
1. This means that $a + b$ where a is very small (e.g. $1.2 \times 2^{-} 10$ )and b ( $1.42 \times 2^{15}$ ) is very large, $a$ must be shifted by right a lot (25 bits!) which may be outside of precision, and thus will be rounded off, and equivalent to not adding something.
Add the mantissas
Put the result in Normalised Form
Round the result

🤖 Harold's Notes