Sources:

Layer-Wise Quantization

  • For each layer with weights and layer input (corresponding to a set of activations obtained using a calibration dataset), we perform the following optimization

  • The quantization grid i.e. the set of value we map to (e.g. int8) is fixed prior to the optimization.
  • A weight scalar value can be mapped to an arbitrary point within the grid i.e. it doesn’t have to be “round-to-nearest”.

One way to solve it using whitening + projecting back

1. Decomposing each term by row

  • Write as its rows, and similarly . Then So the total squared error is just the sum of each row’s error:

  • Each row is independent of the others.

  • If were the identity matrix, then would just be the plain Euclidean distance , and you’d clearly pick each component to be the nearest allowed quantized value to .

  • With a general , each row error is You can think of as a “weighting matrix” that tells you which directions in -space matter more: if along some direction is large, errors in that direction get penalized more heavily.

2. “Whitening”—turning the problem into plain Euclidean distance

We can remove that weighting by a simple change of variables:

  • Compute the square-root of the matrix . Call it . - Define a new row vector - Then which is just ordinary Euclidean distance in the “primed” space.

3. Nearest-neighbor quantization in the primed space

Now in this transformed space, the best quantized simply picks each coordinate Here is your codebook (e.g. the set of 256 8-bit values).

4. Transform back to the original space

Finally, you undo the change of variables: and quantize back again to nearest-neighbour.

Optimal Brain Quantization

  • The Optimal Brain Quantization (OBQ) method solves the layer-wise quantization problem defined above.

  • Then, OBQ handles each row independently, quantizing one weight at a time while always updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing a single weight.

  • Since the corresponding objective is a quadratic, whose Hessian is , where denotes the set of remaining full-precision weights, the greedy-optimal weight to quantize next, which we denote by , and the corresponding optimal update of all weights in , denoted by , are given by the following formulas, where rounds to the nearest value on the quantization grid: OBQ quantizes weights iteratively using these two equations, until all the weights of are quantized.

  • OBQ quantizes weights in greedy order, i.e. it always picks the weight which currently incurs the least additional quantization error.

GPTQ v1

GPTQ v1 is a bunch of (quite involved) tricks to make OBQ fast and stable numerically for large models.

GPTQ v2: Asymmetric Calibration and Parallelization

  • GPTQ treats every layer independently.

  • GPTQ v2 takes into account how quantized layers progressively transform the activation patterns. This becomes particularly acute as quantization proceeds through deeper layers.

  • The layer-wise optimization objective can be written as

  • GPTQv2 optimization objective takes the modified activation (potentially provoked both by activation quantization and weight quantization in previous layers.)

Practical details

  • All methods are weight-only and asymmetric per-channel by default.

  • Works out of the box with signed-INT4 kernels if you clamp min = -max (symmetric) or if your kernel supports per-channel zero-points.

    • Popular inference projects (ExLlama, Marlin) therefore tend to post-process GPTQ files into symmetric form.

How to post-process from asymmetric to symmetric

  • For each channel (or group) with bit-width

    • dequantize asymmetrically:
    • find the symmetric range:
    • new symmetric scale:
    • Re-quantize:
  • This a loss-less rewrite unless some values saturate to

    • In practice, the hit on perplexity is for Llama 2 7B