Sources:
-
HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs
-
As a reminder, rotation matrices are orthogonal matrices.
-
For the math, please refer to Using orthogonal matrices for better quantization - The math
The method
-
QuaRot and SpinQuant are very similar, so we’ll cover their constructions together.
-
A summary of the rules derived in How to use orthogonal matrices in a classic residual NN (purely offline)
- use RMSNorm (pre or post-norm is fine)
- The embedding matrix should be offline right-mutliplied
- The input featurizer weights should be offline left-multiplied (e.g. QKV projections, gate and up projection in SwiGLU)
- We can absorb the RMSNorm diagonals scaling parameters into the input featurizer pre-processing.
- The output featurizer weights should be offline right-multiplied (usually just a linear layer)
- The LM head matrix should be offline left-mutliplied
-
Such rules apply for any residual block consisting of pre/post-norm, a featurizer (e.g. QKV), an operator/system (e.g. self-attention), and an output featurizer (e.g. ).
Diagram of the above rules
- Additionally, we can decide to optionally apply online rotations to reduce outliers within a block, usually before or after the application of the operator.
- For the KV-cache
- in order to be able to store the KV cache quantized.
- This is at the cost of one rotation during prefill, and two rotations during decode (one before quantizing the new KV values, one before dequantizing the current KV values)
- After the gating in the SwiGLU
- This is at the cost of one rotation before quantizing the gate output
- For the KV-cache
How to choose the rotation matrices?
-
As said in Hadamard matrices, using such structured matrices allows us to use Walsh-Hadamard transform, which computes the matrix-vector product in operations.
- This is especially important if we use online rotation.
-
However, any random orthogonal matrix (obtained by taking from the QR-decomposition on any random real matrix ) is theoretically sufficient.
-
However, according to SpinQuant, the performance variance given a random matrix is quite large.
Zero-shot accuracy of W4A4
- Thus, SpinQuant learn and on the Stiefel manifold i.e. the set of all orthnormal matrices. They use Cayley SGD.
How to quantize the rotated weights?
- Both methods use GPTQ after having rotated the weights.
- One reason why these methods work well with GPTQ is that the activations are rotated.
- Remember that a random rotation turns a fixed vector into a direction whose coordinates behave as if they were i.i.d. .
- Beyond the fact the fact this reduces outliers with high probability, this also decorrelates , meaning that the covariance matrix is nearly diagonal. This makes the optimization problem better behaved :)