-
Whenever a low-precision dtype is used one has to be careful not to accumulate intermediary results in that dtype.
-
LayerNorm-like operations must not do their work in half-precision, or they may lose a lot of data.
-
Generally it’s just the accumulation that is done in fp32, since adding up many low-precision numbers is very lossy otherwise.
-
Default mode for IEEE: round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
Summation
- Rewrite the smaller number such that its exponent matches with the exponent of the larger number.
- This means that where a is very small (e.g. )and b () is very large, must be shifted by right a lot (25 bits!) which may be outside of precision, and thus will be rounded off, and equivalent to not adding something.
- Add the mantissas
- Put the result in Normalised Form
- Round the result