Sources:
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
-
GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration
-
QQQ: Quality Quattuor-Bit Quantization for Large Language Models
-
The goal of these methods is to optimally compress the weights such that the L2 distance between the original activations and the activations from the quantized model is minimal. The activations are obtained from a calibration dataset.
Layer-Wise Quantization
- For each layer with weights and layer input (corresponding to a set of activations obtained using a calibration dataset), we perform the following optimization
- The quantization grid i.e. the set of value we map to (e.g. int8) is fixed prior to the optimization.
- A weight scalar value can be mapped to an arbitrary point within the grid i.e. it doesn’t have to be “round-to-nearest”.
One way to solve it using whitening + projecting back
1. Decomposing each term by row
-
Write as its rows, and similarly . Then So the total squared error is just the sum of each row’s error:
-
Each row is independent of the others.
-
If were the identity matrix, then would just be the plain Euclidean distance , and you’d clearly pick each component to be the nearest allowed quantized value to .
-
With a general , each row error is You can think of as a “weighting matrix” that tells you which directions in -space matter more: if along some direction is large, errors in that direction get penalized more heavily.
2. “Whitening”—turning the problem into plain Euclidean distance
We can remove that weighting by a simple change of variables:
- Compute the square-root of the matrix . Call it . - Define a new row vector - Then which is just ordinary Euclidean distance in the “primed” space.
3. Nearest-neighbor quantization in the primed space
Now in this transformed space, the best quantized simply picks each coordinate Here is your codebook (e.g. the set of 256 8-bit values).
4. Transform back to the original space
Finally, you undo the change of variables: and quantize back again to nearest-neighbour.
Optimal Brain Quantization
-
The Optimal Brain Quantization (OBQ) method solves the layer-wise quantization problem defined above.
-
Then, OBQ handles each row independently, quantizing one weight at a time while always updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing a single weight.
-
Since the corresponding objective is a quadratic, whose Hessian is , where denotes the set of remaining full-precision weights, the greedy-optimal weight to quantize next, which we denote by , and the corresponding optimal update of all weights in , denoted by , are given by the following formulas, where rounds to the nearest value on the quantization grid: OBQ quantizes weights iteratively using these two equations, until all the weights of are quantized.
-
OBQ quantizes weights in greedy order, i.e. it always picks the weight which currently incurs the least additional quantization error.
GPTQ v1
GPTQ v1 is a bunch of (quite involved) tricks to make OBQ fast and stable numerically for large models.
GPTQ v2: Asymmetric Calibration and Parallelization
-
GPTQ treats every layer independently.
-
GPTQ v2 takes into account how quantized layers progressively transform the activation patterns. This becomes particularly acute as quantization proceeds through deeper layers.
-
The layer-wise optimization objective can be written as
- GPTQv2 optimization objective takes the modified activation (potentially provoked both by activation quantization and weight quantization in previous layers.)
Practical details
-
All methods are weight-only and asymmetric per-channel by default.
-
Works out of the box with signed-INT4 kernels if you clamp
min = -max
(symmetric) or if your kernel supports per-channel zero-points.- Popular inference projects (ExLlama, Marlin) therefore tend to post-process GPTQ files into symmetric form.
How to post-process from asymmetric to symmetric
-
For each channel (or group) with bit-width
- dequantize asymmetrically:
- find the symmetric range:
- new symmetric scale:
- Re-quantize:
-
This a loss-less rewrite unless some values saturate to
- In practice, the hit on perplexity is for Llama 2 7B