Summary
- Spectrum selectively trains a subset of layers in full precision based on their SNR.
- It skips matrices with insignificant singular value.
- This has several advantages:
- preservation of factual and scattered information from the pre-training phase, retaining layers with diverse information
- training focuses on more stable and well-posed matrices with larger max-min singular values;
- Targeting matrices with larger singular values enables us to emphasize transformations with the largest impact on latent representation
Mathematical Foundation
-
For a weight matrix , the SVD is given by where are orthogonal, and is a diagonal matrix of singular values.
- The eigenvalues of are related to the squared singular values;
-
Random-matrix theory (RMT) provides insights into the nature of data represented by a matrix.
- For large matrices from real-world data, the bulk of eigenvalues/singular values typically forms a “bulk spectrum”. Values associated with less frequent data points often deviate from this bulk and can be misinterpreted as noise.
- RMT helps distinguish signal from noise, but identifying meaningful signals from deviations requires careful consideration
Marchenko-Pastur distribution
-
The Marchenko-Pastur distribution describes the eigenvalue distribution of large random matrices as dimensions tend to infinity with a fixed aspect ratio.
- only applicable to square matrices, thus we will use (which is always square and which eigenvalues are the squared singular values ) to inform us about singular values
-
For a matrix W of size , the eigenvalues of converge to a distribution bounded by:
-
-
where are the largest and smallest eigenvalues, and is the standard deviation. This leads to bounds on the singular values of W:
Signal-to-Noise Ratio and Matrix Ranking
Math
- To ensure numerical stability and efficient computation, they omit the normalization term when calculating singular value bounds.
- They also use the interquartile range instead of the standard deviation to account for potential skewness and kurtosis.
- The signal-to-noise ratio (SNR) of a weight matrix is defined as:
- where separates signal from noise singular values.
- They normalize by the largest singular value for sensitivity analysis, enhanced comparison, and conditioning information.
- Matrices with higher SNR contain more informative features and less noise, making them ideal targets for efficient learning and improved model performance.
Code
## computing inter-quartile range as the estimation of sigma
def estimate_sigma_with_full_iqr(S):
q75 = torch.quantile(S, 0.75)
q25 = torch.quantile(S, 0.25)
iqr = q75 - q25
sigma_estimated = iqr / 1.349
return sigma_estimated
## computing epsilon_plus
def marchenko_pastur_threshold(sigma, n, m):
beta = n / m if n < m else m / n
threshold = sigma * np.sqrt((1 + np.sqrt(beta)) ** 2)
return threshold
S = torch.linalg.svdvals(weights)
max_singular_value = S[0]
n, m = weights.shape[-2:]
mp_threshold = marchenko_pastur_threshold(sigma_estimated, n, m)
## filtering
signal_mask = S > mp_threshold
noise_mask = ~signal_mask
signal = S[signal_mask].sum() if signal_mask.any() else torch.tensor(0.0, device=S.device)
noise = S[noise_mask].sum() if noise_mask.any() else torch.tensor(1.0, device=S.device)
snr = signal / noise if noise != 0 else float('inf')
snr_ratio = snr / max_singular_value
self.layer_snr[name] = {'type': layer_type, 'snr': snr_ratio.item()}