A Spectral Condition for Feature Learning

Summary

Feature learning is achieved by scaling the spectral norm of weight matrices and their updates like $f a n_{o u t} / f a n_{in}$
- i.e. weight matrices should take in vectors fan_in of length and spit out weight vectors of length fan_out.
- This scaling happens at init time and also at gradient update time.
An important fact about a matrix self.weight with fan_in much larger than fan_out is that the null space is huge, meaning that most of the input space is mapped to zero. The dimension of the null space is at least fan_in - fan_out. At initialization, most of a fixed input x will lie in this nullspace.
- This means that to get the output of self.forward to have unit variance at initialization, you need to pick a huge initialization scale sigma in order to scale up the component of x that does not lie in the null space.
- But after a few steps of training, the situation changes. Gradient descent will cause the input x to align with the non-null space of self.weight. This means that the sigma you chose to control the activations at initialization is now far too large in hindsight, and the activations will blow up! This problem only gets worse with increasing fan_in.
- The solution to this problem is simple: don’t choose sigma to control variance at initialization! Instead, choose sigma under the assumption that inputs fall in the non-null space. Even if this makes the activations too small at initialization, this is fine as they will quickly “warm up” after a few steps of training.

Detailed

Math

Alignment

It’s worth expanding a little on what we mean by alignment here. When we say that an input x aligns with a weight matrix weight, we mean that if we compute U, S, V = torch.linalg.svd(weight), then the input x will tend to have a larger dot product with the rows of V that correspond to larger diagonal entries of the singular value matrix S. When we say that layers align, we mean that the outputs of one layer will align with the next layer.
- This happens naturally during gradient descent. https://jeremybernste.in/modula/golden-rules/#three-golden-rules

Spectral norm

Spectral Norm
- The spectral norm is the largest factor by which a matrix can increase the norm of a vector on which it acts.
- In the case of deep learning, the spectral norm of a weight matrix upper-bounds the activation scale

Feature learning

Feature learning regime can be summarized as: both the features and their updates upon a step of gradient descent must be the proper size.
Let $h_{ℓ} (x) \in R^{n_{ℓ}}$ denote the features of input $x$ at layer $ℓ$ of a neural network, and let $Δ h_{ℓ} (x) \in R^{n_{ℓ}}$ denote their change after a gradient step. We desire that: $∣ h_{ℓ} ∣_{2} = Θ (n_{ℓ}) and ∣Δ h_{ℓ} ∣_{2} = Θ (n_{ℓ}), at layers ℓ = 1, ..., L - 1$
This amounts to asking that the “typical element size” of vectors $h_{ℓ} (x)$ and $Δ h_{ℓ} (x)$ is $Θ (1)$ with respect to width $n_{ℓ}$ .
- motivated by the fact that activation functions are designed to take order-one inputs and give order-one outputs (e.g. tanh)
- the requirement stipulates that feature entries also undergo $Θ (1)$ updates during training. Note that any larger updates would blow up at large width, and any smaller updates would vanish at large width.

Condition 1 (Spectral scaling)

The main message is that feature learning in the sense of the above definition may be ensured by the following spectral scaling condition on the weight matrices of a deep network and their gradient updates.
Consider applying a gradient update $Δ W_{ℓ} \in R^{n_{ℓ} \times n_{ℓ - 1}}$ to the $ℓ$ th weight matrix $W_{ℓ} \in R^{n_{ℓ} \times n_{ℓ - 1}}$ . The spectral norms of these matrices should satisfy: $∣ W_{ℓ} ∣_{*} = Θ (\frac{n _{ℓ}}{n _{ℓ - 1}}) and ∣Δ W_{ℓ} ∣_{*} = Θ (\frac{n _{ℓ}}{n _{ℓ - 1}}) at layers ℓ = 1, ..., L$
We have implicitly assumed that the input has size $∣ x ∣_{2} = Θ (n_{0})$ , which is standard for image data. Language models are an important counterexample, where embedding matrices take one-hot inputs (i.e. not all the width acts on the first layer) and the $n_{0}$ in Condition 1 should be replaced by 1.

Parametrization 1 (Spectral parametrization) - Eficient implementation of the spectral scaling condition

Spectral scaling induces feature learning
How to implement it ?
They claim that the spectral scaling condition (Condition 1) is satisfied and feature learning is achieved (as per Desideratum 1) if the initialization scale and learning rate of each layer $ℓ$ are chosen according to:

$σ_{ℓ} = Θ (\frac{1}{n _{ℓ - 1}} min (1, \frac{n _{ℓ}}{n _{ℓ - 1}})) and η_{ℓ} = Θ (\frac{n _{ℓ}}{n _{ℓ - 1}})$

Random initialization

Common practice, $W_{ℓ}$ is initialized as $W_{ℓ} = σ_{ℓ} \cdot W_{ℓ}^{'}$ , where all elements of $W_{ℓ}^{'}$ are initialized i.i.d. from a normal distribution with mean zero and unit variance. The spectral norm of a matrix thus constructed is roughly: $∣ W_{ℓ} ∣_{*} \approx σ_{ℓ} \cdot (n_{ℓ} + n_{ℓ - 1})$
- To get the desired scaling $∣ W_{ℓ} ∣_{*} = Θ (n_{ℓ} / n_{ℓ - 1})$ , we need merely choose: $σ_{ℓ} = Θ (n_{ℓ} / n_{ℓ - 1} \cdot (n_{ℓ} + n_{ℓ - 1})^{- 1})$
  - Simplifying within the $Θ (\cdot)$ , we arrive at $σ_{ℓ}$ scaled as in the spectral parametrization (Parametrization 1). Initializing weights with a prefactor $σ_{ℓ}$ scaling in this manner achieves the correct spectral norm of $W_{ℓ}$ .
    - We note that the constant factor suppressed by the $Θ (\cdot)$ here will usually be small—for example, a prefactor of $2$ agrees with typical practice for ReLU networks at most layers.
If $W_{ℓ}^{'}$ is instead a random semi-orthogonal matrix (spectral norm of 1), then we can simply use a prefactor: $σ_{ℓ} = Θ (n_{ℓ} / n_{ℓ - 1})$

Biases

Extend the spectral analysis to biases.
Let $b_{ℓ} \in R^{n_{ℓ}}$ be a bias vector which enters during forward propagation as $h_{ℓ} (x) = W_{ℓ} h_{ℓ - 1}^{'} (x) + b_{ℓ}$ . We may choose to view the bias vector as a weight matrix $b_{ℓ} \in R^{n_{ℓ} \times 1}$ connecting an auxiliary layer with width 1 and output 1 to the $ℓ$ th hidden layer, after which we may simply apply our scaling analysis for weight matrices.
The spectral scaling condition (Condition 1) prescribes that $∣ b_{ℓ} ∣_{2} = Θ (n_{ℓ})$ and $∣Δ b_{ℓ} ∣_{2} = Θ (n_{ℓ})$ , and Parametrization 1 prescribes that the initialization scale and learning rate should be $σ_{ℓ}^{b} = Θ (1)$ and $η_{ℓ}^{b} = Θ (n_{ℓ})$ . In practice, one may usually just take $σ_{ℓ}^{b} = 0$ .

Comparison to standard parametrization (SP)

“Kaiming,” “Xavier,” or “LeCun” initialization
$σ_{ℓ} = Θ (\frac{1}{n _{ℓ - 1}}) and η_{ℓ} = Θ (1)$
Notice that SP initialization exceeds Spectral Parametrization in any layer with fan-out smaller than fan-in, (e.g. second MLP in GLU in most cases).

🤖 Harold's Notes

Explorer

Spectral mup

Summary

Detailed

Math

Alignment

Spectral norm

Feature learning

Condition 1 (Spectral scaling)

Parametrization 1 (Spectral parametrization) - Eficient implementation of the spectral scaling condition

Random initialization

Biases

Comparison to standard parametrization (SP)

Graph View

Table of Contents

Backlinks