Summary
 uμP, which improves upon μP by combining it with Unit Scaling:
 μP ensures that the scale of activations is independent of model size
 Unit Scaling ensures that activations, weights and gradients begin training with a scale of one.
Detailed
 you need to divide the residual add to prevent the model from blowing up
Reminder of mup setting
ABCparametrization
 ABCparametrizations μP, SP, and the Neural Tangent Kernel (NTK) are all instances of abcparametrizations. This assumes a model under training where weights are defined as:
 $w_{0}∼N(0,B_{W})$
 $W_{t}=A_{W}⋅w_{t}$,
 $w_{t+1}=w_{t}+C_{W}⋅Φ_{t}(∇L_{0},...,∇L_{t})$,
 with $t$ a timestep and $Φ_{t}(∇L_{0},...,∇L_{t})$ the weight update based on previous loss gradients.
 $A_{W}$ = parameter multiplier
 For example, in the attention logit calculation $⟨k,q⟩/d_{head} $ where $q=Wx$, the $1/d_{head} $ factor is a multiplier. It may also be thought of as the parameter multiplier of $W$ if we rewrite the attention logit as $⟨k,(W/d_{head} )x⟩$.
 Note that parameter multipliers cannot be absorbed into the initialization in general, since they affect backpropagation. Nevertheless, after training is done, parameter multipliers can always be absorbed into the weight.
 $B_{W}$ = perparameter initialization
 $C_{W}$ = perparameter learning rate
 A parametrization scheme such as μP is then defined specifying how scalars $A_{W}$ , $B_{W}$ , $C_{W}$ change with model width.
 This can be expressed in terms of widthdependent factors $a_{W}$, $b_{W}$, $c_{W}$, such that $A_{W}∝a_{W}$, $B_{W}∝b_{W}$, $C_{W}∝c_{W}$.
 (The scaling defining mup)
ABCsymmetry
 A key property of the abcparametrization is that one can shift scales between $A_{W}$, $B_{W}$, $C_{W}$ in a way that preserves learning dynamics (i.e. the activations computed during training are unchanged). We term this abcsymmetry. For a fixed $θ>0$, the behavior of a network trained with Adam is invariant to changes of the kind: $A_{W}←A_{W}⋅θ,B_{W}←B_{W}/θ,C_{W}←C_{W}/θ$
 This means that parametrizations like $μP$ can be presented in different but equivalent ways. ABCsymmetry is a key component in developing u$μP$. (and why the umup scheme is consistent mup and Spectral mup)
Transferable HPs
 The above terms are defined by the parametrization choice, however there are also the hyperparameters chosen by the user.
 All μTransferable HPs function as multipliers and can be split into three kinds, which contribute to the three (nonHP) multipliers given by the abcparametrization: $α_{W}$, $σ_{W}$, $η_{W}$ where $A_{W}∝α_{W}$, $B_{W}∝σ_{W}$, $C_{W}∝η_{W}$.
 $α_{W}$ = operator scaling
 $σ_{W}$ = init scaling
 $η_{W}$ = perparameter learning rate scaling
The challenges with mup in practice

Not all training setups give muTransfer
 Vanilla mup works in overfitting regime but fails to generalize to standard LM training regime (underfitting)
 The fix is (1) Removal of trainable parameters from normalization layers (2) Use the independent form of AdamW Independent weight decay

It’s not clear which hyperparameters to sweep
 In theory, the search space of μTransferable HPs includes $α_{W}$, $σ_{W}$, $η_{W}$ for every parameter tensor W in the model
 there’s coupling between them if you’re not careful
 The relative size of a weight update is determined by the ratio $μ_{W}/σ_{W}$ (size of update / size of current weight)
 Consider the commonlyused global $σ_{init}$ HP. At initialization the activations going into the FFN swish function have $std(x_{swish})∝σ_{W_{gate}}$, whereas the selfattention softmax activations have $std(x_{attn})∝σ_{W_{Q}}σ_{W_{K}}$. A global $σ$ HP thus has a linear effect on the FFN and a quadratic effect on attention, suggesting that this grouping may not be ideal.

Base shape complicates usage
 Orignal mup requires an extra “base” model to correctly init the model

mup struggles with lowprecision
 Lowprecision training runs succesfully converging in SP can diverge with mup because of the generally lower init and scaling (underflow of gradients)
The UnitScaled Maximal Update Parametrization
The umup ABC parametrization

How they go from mup to umup

drop the $σ_{W}$ (because unit scaling assumes unit variance)
 they can do so by using ABCparametrization to shift the scale in $B_{W}$ (i.e. the $fan−in 1 $) from mup to $A_{W}$ and $C_{W}$ (Equation 4 and 5 in the paper)

drop the $base−fan−in$ HP with the above trick and also shifting the $α_{W}$ HPs burden to the unit scaling ops

change the input learning rate to $1/fan−out $.
 slight deviation for mup in terms of the math
 key change for performance, not much theoretical justification


It’s important to note that the scaling rules in this table must be combined with the standard Unit Scaling rules for other nonmatmul operations.
 e.g. gated SILU, residual_add, softmax crossentropy, …
Why it works better than vanilla mup
 We can attribute the difficulties μP has with low precision to the fact that it ignores constant factors (along with weight and gradientscaling), only ensuring that activations are of order $Θ(1)$. The stricter condition of unit scale across all tensors at initialization provides a way of leveraging μP’s rules in order to make lowprecision training work.
A principled approach to hyperparameters
 How to sweep HPs has been a mess in mup literature
 We want
 Minimal cardinality: the use of as few HPs as possible.
 Minimal interdependency: the optimal value of each HP should not depend on the value of other HPs, simplifying the search space.
 Interpretability: there should be a clear explanation for what an HP’s value ‘means’ in the context of the model
 we can drop all $σ_{W}$ as we assume unit scaling (and abcsymmetry allows us to do so), leaving just $α_{W}$ and $η_{W}$
 Second, several $α_{W}$ combine linearly with other $α_{W}$ HPs. It is easier to define things at the operator level instead of the weight level, e.g. $std(x_{attn})∝α_{W_{Q}}α_{W_{K}}$. In this instance, it is more natural to use a single $α$ parameter and associate it with $α_{attn−softmax}$
 Use a single global $η$ and group $α$ HPs across layers. (This is the best tradeoff between expressivity and cardinality)

The considered hyperparameters for a Transformer

How to choose such operator multipliers for an architecture is described in Appendix F and Appendix G
 The idea is that you want a multiplier at every operation in your computational graph where there’s a nonhomogeneous function or a nonunary function (not single input)
 i.e. a khomogeneous function is $h→f(h)$ s.t. $f(αh)=α_{k}f(h)$
 RMS norm is 0homo., Linear is 1homo. and QK matmul is 2homo.
 A residual add is nonunary
 Sigmoid and crossentropy loss is nonhomogeneous
 After having settled on needed multipliers, simplify them following the minimal cardinality, expressivity, and interpretability constraints
 assuming unitscaling usually allows for more interpretable mutlipliers
 The idea is that you want a multiplier at every operation in your computational graph where there’s a nonhomogeneous function or a nonunary function (not single input)
How to do the HPs sweep

The standard approach to HP search for muTransfer is via random sweep over all HPs simultaneously. This is costly

Due to the interdependence criterion applied previously, umup supposedly allows for a simpler scheme, called independent search

The idea is to first sweep the LR, followed by a set of onedimensional sweeps of the other HPs (which can be run in parallel). The best results from the individual sweeps are combined to from the final set of HP values.

The simpler scheme, which only sweeps the LR, leaving other HP values at 1, seems to work well in practice.
Numerical properties

mup has gradients and weights with low RMS, at risk of FP8 underflow, whereas uμP starts with RMS ≈ 1.

Many input activations do not grow RMS during training (due to a preceding nontrainable RMSNorm), however the attention out projection and FFN down projection have unconstrained input activations that grow considerably during training

The decoder weight grows during training. Since it is preceded by a RMSNorm, the model may require scale growth in order to increase the scale of softmax inputs. Other weights grow slightly during training.

Gradients grow quickly but stabilize, except for attention out projection and FFN down projection, whose gradients shrink as the inputs grow.

The the main parameter affecting scale growth is learning rate
 Endtraining RMS is remarkably stable as width, depth, training steps and batch size are independently increased.
Prerequisites before applying umup
 Remove trainable parameters from normalization layers
 Use the independent form of AdamW Independent weight decay
 Ensure training is the underfitting regime (i.e. avoid excessive data repetition)
A guide to using umup
 Being careful of tensors with scale growth (inputs to FFN and selfattention final projections)
 use E5M2 format to represent the larger scales or apply dynamic rescaling of the matmul input
 apply unit scaling with the correct scale constraints, for new operations, don’t hesitate to fit an empirical model for the scale of the op.
Hyperparameter transfer results
 The setting matters a lot (i.e. number of tokens, model size, sequence length). Should be as representative as possible of the final training.
 At large scale, learning rate $η$ and residual attention ratio $α_{res−attn−ratio}$were the most important HPs. All other HPs can be left at their default value of 1.
 NonLR HPs also have approximately constant optima across width under uμP
How to use a good proxy model
 When using a relatively small proxy model with 8 layers and a width of 512 (4 attention heads), the HPloss landscape is rather noisy. By doubling the width, they are able to discern the optimal values of the HPs more clearly.
 In general, width is the most reliable feature to transfer. Training steps and batch size also give good transfer, so moderate changes here are permissible. Depth is the least reliable feature for transfer, so they only recommend modest changes in depth
 Keep the number of warmup steps constant, but always decay to the same final LR when varying the number of steps.