-
Have a look at the make_qx_quants function in ggml-quants.c (knowing that
rmse_type
is always1
) which is used to find the scale ofQ3_K
andQ6_K
(i.e. the k-quants which donβt use a min, a bit likeQ8_0
). -
For the k-quants which do have a min, (
Q2_K
,Q4_K
, andQ5_K
), thereβs the make_qkx2_quants function which seems to do something similar but with a min too.
Classic abs-max 4 bit quantization
- Example:
Q4_0
, type-0 4-bit quantization with blocks having 32 weights.
General implementation of llama.cpp
-
The function first finds the maximum absolute value in the input array
x
. -
If all values are close to zero (less than GROUP_MAX_EPS), it sets all quantized values to 0 and returns.
-
It calculates an initial scale factor
iscale
based on the maximum value andnmax
. (like abs-max quantization)- If
rmse_type
is 0, it stops here, performs a simple quantization using the initial scale and returns.
- If
-
For other
rmse_type
values, it performs a more complex quantization process:-
It calculates the quantization error as weighted sums of the input values and their quantized representations.
- Looping over
for (int i = 0; i < n; ++i)
l
is the quantized representation ofx[i]
w
is the importance factor (defined either byrmse_type
or provided byqw[i]
)rmse_type == 1
: Weights error by square of input value (the default) i.e.x[i]*x[i]
rmse_type == 2
: Uniform weightingrmse_type == 3
: Weights error by absolute value of inputrmse_type == 4
: Weights error by square root of absolute value of input
sumlx += w*x[i]*l
suml2 += w*l*l
- Looping over
-
It computes an optimal scale factor that minimizes the quantization error.
- This is
float scale = suml2 ? sumlx/suml2 : 0
- Proof:
- . Quantization Model: Let be the original values, be the integer quantized values, and be the scale factor. The quantized representation of is .
- Weighted Error: Let be the weight for each value. The weighted squared error is:
- Minimizing Error: To find the optimal scale, we minimize with respect to :
- Solving for Optimal Scale: From the above equation:
- Correspondence to Code: In the code: is calculated as
sumlx
. is calculated assuml2
- This is
-
-
However, due to the discrete nature of quantization, this analytically derived optimal scale might not always give the absolute best results in practice. It thus then goes on to do scale refinement
- The function then tries to improve this scale by testing 18 nearby values (9 above and 9 below the initial scale).
- For each tested scale:
- It requantizes the input values.
- Recalculates
sumlx
andsuml2
. - Checks if this new scale produces a better result (higher
sumlx*sumlx / suml2
). - If a better scale is found, it updates the quantized values in
L
and thescale
variable.
- For each tested scale:
- The function then tries to improve this scale by testing 18 nearby values (9 above and 9 below the initial scale).
-
The quantized values are stored in the
L
array, offset bynmax
to make them non-negative. -
The function returns the final scale factor, which can be used to dequantize the values later.