-
Have a look at the make_qx_quants function in ggml-quants.c (knowing that
rmse_type
is always1
) which is used to find the scale ofQ3_K
andQ6_K
(i.e. the k-quants which donβt use a min, a bit likeQ8_0
). -
For the k-quants which do have a min, (
Q2_K
,Q4_K
, andQ5_K
), thereβs the make_qkx2_quants function which seems to do something similar but with a min too.
Classic abs-max 4 bit quantization
- Example:
Q4_0
, type-0 4-bit quantization with blocks having 32 weights.
// reference implementation for deterministic creation of model files
void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int64_t k) {
static const int qk = QK4_0; // defined as 32
assert(k % qk == 0);
const int nb = k / qk; // num_blocks within row
// iterate over each block
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
float max = 0.0f; // actual max value (can be negative)
// find max within block
for (int j = 0; j < qk; j++) {
const float v = x[i*qk + j];
if (amax < fabsf(v)) {
amax = fabsf(v);
max = v;
}
}
const float d = max / -8; // max absolute value in signed 4 bit integer is in the negative range i.e. -8
// this means that the max value is mapped to 0 in unsigned 4 bit and the min value is close to 15 ()
const float id = d ? 1.0f/d : 0.0f; //calculate inverse of d for faster mult later, also checks if d is zero, if yes, then all the value are zero, and no scaling is needed
// Store the scale factor as float16
y[i].d = GGML_FP32_TO_FP16(d);
// Quantize the values in the block
for (int j = 0; j < qk/2; ++j) {
// Process two values at a time because int4_t doesn't exist
const float x0 = x[i*qk + 0 + j]*id;
const float x1 = x[i*qk + qk/2 + j]*id;
// Quantize to UNSIGNED 4-bit integers
// Add 8.5f to move from signed to unsigned,
// then clamp between 0 and 15
const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
y[i].qs[j] = xi0;
y[i].qs[j] |= xi1 << 4;
}
}
}
General implementation of llama.cpp
// implementation for type-0 k-quants
static float make_qx_quants(
int n, // num weights within one block
int nmax, // max representable value for the bit format we're quantizing to
const float * restrict x, // input array
int8_t * restrict L, // Output array of quantized int8_t values
int rmse_type, // defines the weighted error metric
const float * restrict qw // optional array that can be used to assign different importance to each element during the quantization process.
)
-
The function first finds the maximum absolute value in the input array
x
. -
If all values are close to zero (less than GROUP_MAX_EPS), it sets all quantized values to 0 and returns.
-
It calculates an initial scale factor
iscale
based on the maximum value andnmax
. (like abs-max quantization)- If
rmse_type
is 0, it stops here, performs a simple quantization using the initial scale and returns.
- If
-
For other
rmse_type
values, it performs a more complex quantization process:-
It calculates the quantization error as weighted sums of the input values and their quantized representations.
- Looping over
for (int i = 0; i < n; ++i)
l
is the quantized representation ofx[i]
w
is the importance factor (defined either byrmse_type
or provided byqw[i]
)rmse_type == 1
: Weights error by square of input value (the default) i.e.x[i]*x[i]
rmse_type == 2
: Uniform weightingrmse_type == 3
: Weights error by absolute value of inputrmse_type == 4
: Weights error by square root of absolute value of input
sumlx += w*x[i]*l
suml2 += w*l*l
- Looping over
-
It computes an optimal scale factor that minimizes the quantization error.
- This is
float scale = suml2 ? sumlx/suml2 : 0
- Proof:
- . Quantization Model: Let be the original values, be the integer quantized values, and be the scale factor. The quantized representation of is .
- Weighted Error: Let be the weight for each value. The weighted squared error is:
- Minimizing Error: To find the optimal scale, we minimize with respect to :
- Solving for Optimal Scale: From the above equation:
- Correspondence to Code: In the code: is calculated as
sumlx
. is calculated assuml2
- This is
-
-
However, due to the discrete nature of quantization, this analytically derived optimal scale might not always give the absolute best results in practice. It thus then goes on to do scale refinement
- The function then tries to improve this scale by testing 18 nearby values (9 above and 9 below the initial scale).
- For each tested scale:
- It requantizes the input values.
- Recalculates
sumlx
andsuml2
. - Checks if this new scale produces a better result (higher
sumlx*sumlx / suml2
). - If a better scale is found, it updates the quantized values in
L
and thescale
variable.
- For each tested scale:
- The function then tries to improve this scale by testing 18 nearby values (9 above and 9 below the initial scale).
-
The quantized values are stored in the
L
array, offset bynmax
to make them non-negative. -
The function returns the final scale factor, which can be used to dequantize the values later.