Summary

Feature learning is achieved by scaling the spectral norm of weight matrices and their updates like $fan_{out}/fan_{in}β$
 i.e. weight matrices should take in vectors
fan_in
of length and spit out weight vectors of lengthfan_out
.  This scaling happens at init time and also at gradient update time.
 i.e. weight matrices should take in vectors

An important fact about a matrix
self.weight
withfan_in
much larger thanfan_out
is that the null space is huge, meaning that most of the input space is mapped to zero. The dimension of the null space is at leastfan_in  fan_out
. At initialization, most of a fixed inputx
will lie in this nullspace. This means that to get the output of
self.forward
to have unit variance at initialization, you need to pick a huge initialization scalesigma
in order to scale up the component ofx
that does not lie in the null space.  But after a few steps of training, the situation changes. Gradient descent will cause the input
x
to align with the nonnull space ofself.weight
. This means that thesigma
you chose to control the activations at initialization is now far too large in hindsight, and the activations will blow up! This problem only gets worse with increasingfan_in
.  The solution to this problem is simple: donβt choose
sigma
to control variance at initialization! Instead, choosesigma
under the assumption that inputs fall in the nonnull space. Even if this makes the activations too small at initialization, this is fine as they will quickly βwarm upβ after a few steps of training.
 This means that to get the output of
Detailed
Math
Alignment
 Itβs worth expanding a little on what we mean by alignment here. When we say that an input
x
aligns with a weight matrixweight
, we mean that if we computeU, S, V = torch.linalg.svd(weight)
, then the inputx
will tend to have a larger dot product with the rows ofV
that correspond to larger diagonal entries of the singular value matrixS
. When we say that layers align, we mean that the outputs of one layer will align with the next layer. This happens naturally during gradient descent. https://jeremybernste.in/modula/goldenrules/#threegoldenrules
Spectral norm
 Spectral Norm
 The spectral norm is the largest factor by which a matrix can increase the norm of a vector on which it acts.
 In the case of deep learning, the spectral norm of a weight matrix upperbounds the activation scale
Feature learning
 Feature learning regime can be summarized as: both the features and their updates upon a step of gradient descent must be the proper size.
 Let $h_{β}(x)βR_{n_{β}}$ denote the features of input $x$ at layer $β$ of a neural network, and let $Ξh_{β}(x)βR_{n_{β}}$ denote their change after a gradient step. We desire that: $β£h_{β}β£_{2}=Ξ(n_{β}β)Β andΒβ£Ξh_{β}β£_{2}=Ξ(n_{β}β),Β atΒ layersΒβ=1,...,Lβ1$
 This amounts to asking that the βtypical element sizeβ of vectors $h_{β}(x)$ and $Ξh_{β}(x)$ is $Ξ(1)$ with respect to width $n_{β}$.
 motivated by the fact that activation functions are designed to take orderone inputs and give orderone outputs (e.g. tanh)
 the requirement stipulates that feature entries also undergo $Ξ(1)$ updates during training. Note that any larger updates would blow up at large width, and any smaller updates would vanish at large width.
Condition 1 (Spectral scaling)

The main message is that feature learning in the sense of the above definition may be ensured by the following spectral scaling condition on the weight matrices of a deep network and their gradient updates.

Consider applying a gradient update $ΞW_{β}βR_{n_{β}Γn_{ββ1}}$ to the $β$th weight matrix $W_{β}βR_{n_{β}Γn_{ββ1}}$. The spectral norms of these matrices should satisfy: $β£W_{β}β£_{β}=Ξ(n_{ββ1}n_{β}ββ)Β andΒβ£ΞW_{β}β£_{β}=Ξ(n_{ββ1}n_{β}ββ)Β atΒ layersΒβ=1,...,L$

We have implicitly assumed that the input has size $β£xβ£_{2}=Ξ(n_{0}β)$, which is standard for image data. Language models are an important counterexample, where embedding matrices take onehot inputs (i.e. not all the width acts on the first layer) and the $n_{0}β$ in Condition 1 should be replaced by 1.
Parametrization 1 (Spectral parametrization)  Eficient implementation of the spectral scaling condition

Spectral scaling induces feature learning

How to implement it ?

They claim that the spectral scaling condition (Condition 1) is satisfied and feature learning is achieved (as per Desideratum 1) if the initialization scale and learning rate of each layer $β$ are chosen according to:
$Ο_{β}=Ξ(n_{ββ1}β1βmin(1,n_{ββ1}n_{β}ββ))Β andΒΞ·_{β}=Ξ(n_{ββ1}n_{β}β)$
Random initialization
 Common practice, $W_{β}$ is initialized as $W_{β}=Ο_{β}βW_{β}$, where all elements of $W_{β}$ are initialized i.i.d. from a normal distribution with mean zero and unit variance. The spectral norm of a matrix thus constructed is roughly: $β£W_{β}β£_{β}βΟ_{β}β(n_{β}β+n_{ββ1}β)$
 To get the desired scaling $β£W_{β}β£_{β}=Ξ(n_{β}/n_{ββ1}β)$, we need merely choose: $Ο_{β}=Ξ(n_{β}/n_{ββ1}ββ(n_{β}β+n_{ββ1}β)_{β1})$
 Simplifying within the $Ξ(β)$, we arrive at $Ο_{β}$ scaled as in the spectral parametrization (Parametrization 1). Initializing weights with a prefactor $Ο_{β}$ scaling in this manner achieves the correct spectral norm of $W_{β}$.
 We note that the constant factor suppressed by the $Ξ(β)$ here will usually be smallβfor example, a prefactor of $2β$ agrees with typical practice for ReLU networks at most layers.
 Simplifying within the $Ξ(β)$, we arrive at $Ο_{β}$ scaled as in the spectral parametrization (Parametrization 1). Initializing weights with a prefactor $Ο_{β}$ scaling in this manner achieves the correct spectral norm of $W_{β}$.
 To get the desired scaling $β£W_{β}β£_{β}=Ξ(n_{β}/n_{ββ1}β)$, we need merely choose: $Ο_{β}=Ξ(n_{β}/n_{ββ1}ββ(n_{β}β+n_{ββ1}β)_{β1})$
 If $W_{β}$ is instead a random semiorthogonal matrix (spectral norm of 1), then we can simply use a prefactor: $Ο_{β}=Ξ(n_{β}/n_{ββ1}β)$
Biases
 Extend the spectral analysis to biases.
 Let $b_{β}βR_{n_{β}}$ be a bias vector which enters during forward propagation as $h_{β}(x)=W_{β}h_{ββ1}(x)+b_{β}$. We may choose to view the bias vector as a weight matrix $b_{β}βR_{n_{β}Γ1}$ connecting an auxiliary layer with width 1 and output 1 to the $β$th hidden layer, after which we may simply apply our scaling analysis for weight matrices.
 The spectral scaling condition (Condition 1) prescribes that $β£b_{β}β£_{2}=Ξ(n_{β}β)$ and $β£Ξb_{β}β£_{2}=Ξ(n_{β}β)$, and Parametrization 1 prescribes that the initialization scale and learning rate should be $Ο_{β}=Ξ(1)$ and $Ξ·_{β}=Ξ(n_{β})$. In practice, one may usually just take $Ο_{β}=0$.
Comparison to standard parametrization (SP)
 βKaiming,β βXavier,β or βLeCunβ initialization
 $Ο_{β}=Ξ(n_{ββ1}β1β)Β andΒΞ·_{β}=Ξ(1)$
 Notice that SP initialization exceeds Spectral Parametrization in any layer with fanout smaller than fanin, (e.g. second MLP in GLU in most cases).