instability was observed for large models. It was caused by extremely large values in attention logits, which lead to (almost one-hot) attention weights with near-zero entropy.
attention logits: zij=⟨qi,kj⟩/dh
models with qk-layernorm exhibit considerably lower LR sensitivity and train to low loss at high learning rates
z-loss
Another instability, when training large models, is divergence in the output logits from the log probabilities [6]. Let y denote the model’s output logits, which are used to compute class probabilities pi via a softmax pi=eyi/Z where Z=∑jeyj.
This instability occurs when the logits diverge and become very negative.
In contrast to the attention logit growth instability, this divergence occurs towards the end of training.
The mitigation proposed by Chowdh- ery et al. [6] is to encourage log Z to remain close to zero i.e. Z≈1.
To do so, they add an auxiliary losslog2Z, referred to as z-loss, with a coefficient of 1e−4.