μTransfer (Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer)
In general, heuristics try to keep the activation scales consistent at initialization. (LayerNorm, BatchNorm, Kaiming Init)
Feature Learning Limit
To unlock feature learning, we need to see gradient updates for what they really are: a different kind of matrices from their randomly initialized counterparts.
When a matrix W∈Rn×n multiplies with an activation vector x∈Rn to produce a pre-activation vector, we calculate a coordinate by taking a row from the matrix W, multiplying it by x coordinate-wise, and summing the coordinates of the resulting vector. When W’s entries are initialized with zero mean, this summation is across roughly independent elements with zero mean. As such, this sum would be n smaller than what it would be if the elements had nonzero mean or were strongly correlated, due to the famous square root cancellation effect underlying phenomena like the Central Limit Theorem.
Small proof
More precisely, var(zk)=var(∑ixiWik)=∑ixi2var(Wik)+∑i=jxixjcov(Wik,Wjk)
The sum is composed of n terms on the left and (2n) on the right (O(n2))
At init time, cov(Wi,Wj)=0, but this is not the case after training. Ratio of the standard deviation between init and training is n2n=n1
The new pre-activation vector is equal to Wx+ΔWx where ΔW are the gradient updates.
Varying width
However, as training starts, this consistency breaks at different model widths.
In the default parameterization in PyTorch, the graph on the left, the activation scales diverge in width after one step of training. But in µP, the graph on the right, the activation scales change by a consistent amount regardless of width for any training step. The y-axis shows the change of network activation scales on a fixed input after t=0, 1, 2, 3, and 4 steps of training as the width of the model varies, which is shown along the x-axis. This enables consistent behaviour (ease to optimize) between different model sizes
µP networks of different widths share similar training dynamics, they likely also share similar optimal hyperparameters. Consequently, we can simply apply the optimal hyperparameters of a small model directly onto a scaled-up version. We call this practical procedure µTransfer. If our hypothesis is correct, the training loss-hyperparameter curves for µP models of different widths would share a similar minimum.
Beyond width
For transformers, the optimal learning rate transfer across width, it also empirically transfers across other scale dimensions—such as depth, batch size, and sequence length.