Independent weight decay

Parameterizing weight decay independently of learning rate reduces LR sensitivity
- recommended by authors but not applied in practice in Pytorch

Formulation

Summary: $A tt e n t i o n (X) = softmax [\frac{1}{d} LN (X W^{Q}) LN (X W^{K})^{T}]$
Why ?
- instability was observed for large models. It was caused by extremely large values in attention logits, which lead to (almost one-hot) attention weights with near-zero entropy.
- attention logits: $z_{ij} = ⟨ q_{i}, k_{j} ⟩ / d_{h}$
- models with qk-layernorm exhibit considerably lower LR sensitivity and train to low loss at high learning rates

Another instability, when training large models, is divergence in the output logits from the log probabilities [6]. Let $y$ denote the model’s output logits, which are used to compute class probabilities $p_{i}$ via a softmax $p_{i} = e^{y_{i}} / Z$ where $Z = \sum_{j} e^{y_{j}}$ .
This instability occurs when the logits diverge and become very negative.
In contrast to the attention logit growth instability, this divergence occurs towards the end of training.
The mitigation proposed by Chowdh- ery et al. [6] is to encourage log Z to remain close to zero i.e. $Z \approx 1$ .
To do so, they add an auxiliary loss $log^{2} Z$ , referred to as z-loss, with a coefficient of $1 e - 4$ .