Sources:

The method

  • QuaRot and SpinQuant are very similar, so we’ll cover their constructions together.

  • A summary of the rules derived in How to use orthogonal matrices in a classic residual NN (purely offline)

    • use RMSNorm (pre or post-norm is fine)
    • The embedding matrix should be offline right-mutliplied
    • The input featurizer weights should be offline left-multiplied (e.g. QKV projections, gate and up projection in SwiGLU)
      • We can absorb the RMSNorm diagonals scaling parameters into the input featurizer pre-processing.
    • The output featurizer weights should be offline right-multiplied (usually just a linear layer)
    • The LM head matrix should be offline left-mutliplied
  • Such rules apply for any residual block consisting of pre/post-norm, a featurizer (e.g. QKV), an operator/system (e.g. self-attention), and an output featurizer (e.g. ).

Diagram of the above rules

  • Additionally, we can decide to optionally apply online rotations to reduce outliers within a block, usually before or after the application of the operator.
    • For the KV-cache
      • in order to be able to store the KV cache quantized.
      • This is at the cost of one rotation during prefill, and two rotations during decode (one before quantizing the new KV values, one before dequantizing the current KV values)
    • After the gating in the SwiGLU
      • This is at the cost of one rotation before quantizing the gate output

How to choose the rotation matrices?

  • As said in Hadamard matrices, using such structured matrices allows us to use Walsh-Hadamard transform, which computes the matrix-vector product in operations.

    • This is especially important if we use online rotation.
  • However, any random orthogonal matrix (obtained by taking from the QR-decomposition on any random real matrix ) is theoretically sufficient.

  • However, according to SpinQuant, the performance variance given a random matrix is quite large.

Zero-shot accuracy of W4A4

  • Thus, SpinQuant learn and on the Stiefel manifold i.e. the set of all orthnormal matrices. They use Cayley SGD.

How to quantize the rotated weights?

  • Both methods use GPTQ after having rotated the weights.
  • One reason why these methods work well with GPTQ is that the activations are rotated.
    • Remember that a random rotation turns a fixed vector into a direction whose coordinates behave as if they were i.i.d. .
    • Beyond the fact the fact this reduces outliers with high probability, this also decorrelates , meaning that the covariance matrix is nearly diagonal. This makes the optimization problem better behaved :)