Orthogonal Matrices and Quantization
Sources:
-
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
-
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
-
o3
TLDR: It reduces outliers.
Why we care about orthogonal matrices
Let denote an orthogonal matrix: Note that multiplying a vector by does not change the norm of the vector, since
In this work, the dimensions of will always match the embedding dimension of the transformer .
1. Setup
We work with a single linear layer
and a uniform symmetric -bit scalar quantiser
where the clipping bound is normally chosen as the largest absolute entry of the tensor that is being quantised. defines the quantization cell width.
We rotate both weights and activations once, with a fixed orthogonal ():
Because ,
so the network is mathematically unchanged before quantisation.
2. Quantisation noise before and after the rotation
Write the quantisation error of a tensor as
In general, the element-wise quantization MSE of the tensor T is equal to .
A small proof is that, consider one quantization cell of width , the error on that cell is the difference between the original sample and the reconstruction point (the cellβs midpoint). We thus assume the error follows . Then, , following the variance of a uniform variable.
We focus on , the resulting output after having quantized both activations and inputs.
Keeping only firstβorder terms (the usual high-resolution approximation):
-
Without rotation:
-
-
With rotation:
-
Hence the output MSE:
The rotation will help because it reduces the variance of every coordinate in and ; that lets us pick a smaller ( smaller smaller noise ). This is what we will derive next.
3. Why an orthogonal rotation shrinks the required dynamic range
Consider a single length- vector (a row of or a column of ).
The quantiser step is .
Let (a mere change of basis).
Because is orthogonal, .
Yet its max-entry is, with high probability, vastly smaller because of the following lemma:
Lemma (βcube-to-sphereβ effect)
Let be any non-zero vector and let be drawn uniformly at random (Haar measure). Write
so is a random point on the ordesphere of radius . Then with probability (thanks to Proposition 1.1 from Luh et OβRourke),
and the expectation of the maximum coordinate satisfies
(The constant is universal and usually quoted between 2 and 3 in the literature.)
Why is it true?
- Gaussian picture β A random rotation turns the fixed vector into a direction whose coordinates behave as if they were i.i.d. .
Formal step: because are independent, . For large we have with high probability.
- Order statistics of Gaussians β For independent variables, we can bound its maximum expected value
- Put together β Substitute :
Consequence:
Because the element-wise quantisation MSE is , the overall error falls by roughly the same factor.
Even a structured fast transform (Hadamard, DCT, butterfly, Householder chains, Givens cascades, β¦) is βsufficiently randomβ for typical correlated neural activations and weights: it still spreads energy almost uniformly, so the dynamic-range argument largely holds in practice.
4. Putting it all together
-
Rotation preserves function value (exact linear algebra identity).
-
Max-entry shrinks (cube-to-sphere), so the clipping range and step size shrink.
-
Quantisation noise power therefore falls.
-
Because we rotate both activations and weights, the first-order output error is reduced on both terms and .
-
Fast structured orthogonals (e.g. Hadamard matrices) give most of the benefit with or even FLOPs instead of the cost of a dense rotation.
Hence: applying one properly chosen (but inexpensive) orthogonal matrix before a single quantisation pass makes the tensors look βrounderβ, dramatically lowering the dynamic range that the scalar quantiser must cover, and thereby reducing the quantisation errorβall while leaving the original network computation untouched.
How to use orthogonal matrices in a classic residual NN (purely offline)
-
The computational invariance theorem in SliceGPT states that the weights and between-block activations of an RMSNorm transformer can be transformed using an orthogonal matrix with no change to the model output.
-
Hereβs the notation from SliceGPT (note that this is post-norm):
Here is the proof for applying orthogonal matrices in a function-preserving way:
-
First fact:
-
Second fact: For a standard post-norm residual block, the following network is equivalent to the one describe above:
- Assume the input-embedding is right-multiplied i.e.
- We will assume the following invariant,
- for each layer, the input is equal to the original layer input right-multiplied by i.e.
- Then we apply the following transforms to each layer:
-
- note that
-
- which is equal to
- this is the expected output right-mutliplied.
-
- Finally, we βreverseβ the right-transform by left-multiplying the LM head
-
The general rules can be summarized as:
- use RMSNorm
- The embedding matrix should be right-mutliplied
- The input featurizer weights should be left-multiplied (e.g. QKV projections, gate and up projection in SwiGLU)
- We can absorb the RMSNorm diagonals scaling parameters into the input featurizer pre-processing.
- The output featurizer weights should be right-multiplied (usually just a linear layer)
- The LM head matrix should be left-mutliplied
Why we care about rotation/Hadamard matrices
A Hadamard matrix is an orthogonal matrix with entries drawing from . A Walsh-Hadamard matrix is a square matrix of size , with
These identities give rise to the Walsh-Hadamard transform, which computes the matrix-vector product in operations.
We care because itβs fast to apply basically :)