Transfer (Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer)

  • In general, heuristics try to keep the activation scales consistent at initialization. (LayerNorm, BatchNorm, Kaiming Init)

Feature Learning Limit

  • To unlock feature learning, we need to see gradient updates for what they really are: a different kind of matrices from their randomly initialized counterparts.
  • When a matrix multiplies with an activation vector to produce a pre-activation vector, we calculate a coordinate by taking a row from the matrix , multiplying it by coordinate-wise, and summing the coordinates of the resulting vector. When ’s entries are initialized with zero mean, this summation is across roughly independent elements with zero mean. As such, this sum would be smaller than what it would be if the elements had nonzero mean or were strongly correlated, due to the famous square root cancellation effect underlying phenomena like the Central Limit Theorem.
    • Small proof
      • More precisely,
      • The sum is composed of terms on the left and on the right ()
      • At init time, , but this is not the case after training. Ratio of the standard deviation between init and training is
    • The new pre-activation vector is equal to where are the gradient updates.

Varying width

  • However, as training starts, this consistency breaks at different model widths.
  • In the default parameterization in PyTorch, the graph on the left, the activation scales diverge in width after one step of training. But in µP, the graph on the right, the activation scales change by a consistent amount regardless of width for any training step. The y-axis shows the change of network activation scales on a fixed input after t=0, 1, 2, 3, and 4 steps of training as the width of the model varies, which is shown along the x-axis. This enables consistent behaviour (ease to optimize) between different model sizes
  • µP networks of different widths share similar training dynamics, they likely also share similar optimal hyperparameters. Consequently, we can simply apply the optimal hyperparameters of a small model directly onto a scaled-up version. We call this practical procedure µTransfer. If our hypothesis is correct, the training loss-hyperparameter curves for µP models of different widths would share a similar minimum.

Beyond width

  • For transformers, the optimal learning rate transfer across width, it also empirically transfers across other scale dimensions—such as depth, batch size, and sequence length.