Spectrum - Target Training on Signal to Noise Ratio

https://arxiv.org/pdf/2406.06623

Summary

Spectrum selectively trains a subset of layers in full precision based on their SNR.
It skips matrices with insignificant singular value.
- This has several advantages:
- preservation of factual and scattered information from the pre-training phase, retaining layers with diverse information
- training focuses on more stable and well-posed matrices with larger max-min singular values;
- Targeting matrices with larger singular values enables us to emphasize transformations with the largest impact on latent representation

Mathematical Foundation

For a weight matrix $W$ , the SVD is given by $W = U S V^{T}$ where $U, V$ are orthogonal, and $S$ is a diagonal matrix of singular values.
- The eigenvalues of $W^{T} W$ are related to the squared singular values; $W^{T} W = V S^{2} V^{T}$
Random-matrix theory (RMT) provides insights into the nature of data represented by a matrix.
- For large matrices from real-world data, the bulk of eigenvalues/singular values typically forms a “bulk spectrum”. Values associated with less frequent data points often deviate from this bulk and can be misinterpreted as noise.
- RMT helps distinguish signal from noise, but identifying meaningful signals from deviations requires careful consideration

Marchenko-Pastur distribution

The Marchenko-Pastur distribution describes the eigenvalue distribution of large random matrices as dimensions tend to infinity with a fixed aspect ratio.
- only applicable to square matrices, thus we will use $W^{T} W$ (which is always square and which eigenvalues are the squared singular values $W$ ) to inform us about $W$ singular values
For a matrix W of size $m \times n$ , the eigenvalues of $C = \frac{1}{n} W^{T} W$ converge to a distribution bounded by:
- $λ_{+} = σ^{2} (1 + \frac{m}{n})^{2},$
- $λ_{-} = σ^{2} (1 - \frac{m}{n})^{2}$
where $λ_{+}, λ_{-}$ are the largest and smallest eigenvalues, and $σ$ is the standard deviation. This leads to bounds on the singular values of W:
- $ε_{+} = \frac{1}{n} σ (1 + \frac{m}{n}),$
- $ε_{-} = \frac{1}{n} σ (1 - \frac{m}{n})$

Signal-to-Noise Ratio and Matrix Ranking

Math

To ensure numerical stability and efficient computation, they omit the normalization term $(1/ n)$ when calculating singular value bounds.
They also use the interquartile range instead of the standard deviation to account for potential skewness and kurtosis.
The signal-to-noise ratio (SNR) of a weight matrix is defined as:

$SNR = \frac{\sum _{k ∣ σ_{k} \geq ε} σ _{k}}{\sum _{n ∣ σ_{n} < ε} σ _{n}}$

where $ε$ separates signal from noise singular values.
They normalize $SNR$ by the largest singular value for sensitivity analysis, enhanced comparison, and conditioning information.
Matrices with higher SNR contain more informative features and less noise, making them ideal targets for efficient learning and improved model performance.

Code

Code: https://github.com/QuixiAI/spectrum/blob/main/spectrum.py

## computing inter-quartile range as the estimation of sigma
def estimate_sigma_with_full_iqr(S):
    q75 = torch.quantile(S, 0.75)
    q25 = torch.quantile(S, 0.25)
    iqr = q75 - q25
    sigma_estimated = iqr / 1.349
    return sigma_estimated
 
## computing epsilon_plus
def marchenko_pastur_threshold(sigma, n, m):
    beta = n / m if n < m else m / n
    threshold = sigma * np.sqrt((1 + np.sqrt(beta)) ** 2)
    return threshold
 
S = torch.linalg.svdvals(weights)
max_singular_value = S[0]
n, m = weights.shape[-2:]
mp_threshold = marchenko_pastur_threshold(sigma_estimated, n, m)
 
## filtering
signal_mask = S > mp_threshold
noise_mask = ~signal_mask
 
 
signal = S[signal_mask].sum() if signal_mask.any() else torch.tensor(0.0, device=S.device)
noise = S[noise_mask].sum() if noise_mask.any() else torch.tensor(1.0, device=S.device)
                        
snr = signal / noise if noise != 0 else float('inf')
snr_ratio = snr / max_singular_value
self.layer_snr[name] = {'type': layer_type, 'snr': snr_ratio.item()}

🤖 Harold's Notes

Explorer

Spectrum - Target Training on Signal to Noise Ratio

Summary

Mathematical Foundation

Marchenko-Pastur distribution

Signal-to-Noise Ratio and Matrix Ranking

Math

Code

Graph View

Table of Contents

Backlinks