Summary

  • Spectrum selectively trains a subset of layers in full precision based on their SNR.
  • It skips matrices with insignificant singular value.
    • This has several advantages:
    • preservation of factual and scattered information from the pre-training phase, retaining layers with diverse information
    • training focuses on more stable and well-posed matrices with larger max-min singular values;
    • Targeting matrices with larger singular values enables us to emphasize transformations with the largest impact on latent representation

Mathematical Foundation

  • For a weight matrix , the SVD is given by where are orthogonal, and is a diagonal matrix of singular values.

    • The eigenvalues of are related to the squared singular values;
  • Random-matrix theory (RMT) provides insights into the nature of data represented by a matrix.

    • For large matrices from real-world data, the bulk of eigenvalues/singular values typically forms a “bulk spectrum”. Values associated with less frequent data points often deviate from this bulk and can be misinterpreted as noise.
    • RMT helps distinguish signal from noise, but identifying meaningful signals from deviations requires careful consideration

Marchenko-Pastur distribution

  • The Marchenko-Pastur distribution describes the eigenvalue distribution of large random matrices as dimensions tend to infinity with a fixed aspect ratio.

    • only applicable to square matrices, thus we will use (which is always square and which eigenvalues are the squared singular values ) to inform us about singular values
  • For a matrix W of size , the eigenvalues of converge to a distribution bounded by:

  • where are the largest and smallest eigenvalues, and is the standard deviation. This leads to bounds on the singular values of W:

Signal-to-Noise Ratio and Matrix Ranking

Math

  • To ensure numerical stability and efficient computation, they omit the normalization term when calculating singular value bounds.
  • They also use the interquartile range instead of the standard deviation to account for potential skewness and kurtosis.
  • The signal-to-noise ratio (SNR) of a weight matrix is defined as:

  • where separates signal from noise singular values.
  • They normalize by the largest singular value for sensitivity analysis, enhanced comparison, and conditioning information.
  • Matrices with higher SNR contain more informative features and less noise, making them ideal targets for efficient learning and improved model performance.

Code

## computing inter-quartile range as the estimation of sigma
def estimate_sigma_with_full_iqr(S):
    q75 = torch.quantile(S, 0.75)
    q25 = torch.quantile(S, 0.25)
    iqr = q75 - q25
    sigma_estimated = iqr / 1.349
    return sigma_estimated
 
## computing epsilon_plus
def marchenko_pastur_threshold(sigma, n, m):
    beta = n / m if n < m else m / n
    threshold = sigma * np.sqrt((1 + np.sqrt(beta)) ** 2)
    return threshold
 
S = torch.linalg.svdvals(weights)
max_singular_value = S[0]
n, m = weights.shape[-2:]
mp_threshold = marchenko_pastur_threshold(sigma_estimated, n, m)
 
## filtering
signal_mask = S > mp_threshold
noise_mask = ~signal_mask
 
 
signal = S[signal_mask].sum() if signal_mask.any() else torch.tensor(0.0, device=S.device)
noise = S[noise_mask].sum() if noise_mask.any() else torch.tensor(1.0, device=S.device)
                        
snr = signal / noise if noise != 0 else float('inf')
snr_ratio = snr / max_singular_value
self.layer_snr[name] = {'type': layer_type, 'snr': snr_ratio.item()}