• Have a look at the make_qx_quants function in ggml-quants.c (knowing that rmse_type is always 1) which is used to find the scale of Q3_K and Q6_K (i.e. the k-quants which don’t use a min, a bit like Q8_0).

  • For the k-quants which do have a min, (Q2_K, Q4_K, and Q5_K), there’s the make_qkx2_quants function which seems to do something similar but with a min too.

Classic abs-max 4 bit quantization

  • Example: Q4_0, type-0 4-bit quantization with blocks having 32 weights.
// reference implementation for deterministic creation of model files
void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int64_t k) {
    static const int qk = QK4_0; // defined as 32
 
    assert(k % qk == 0);
 
    const int nb = k / qk; // num_blocks within row
 
	// iterate over each block
    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f; // actual max value (can be negative)
        
		// find max within block
        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }
 
        const float d  = max / -8; // max absolute value in signed 4 bit integer is in the negative range i.e. -8
        // this means that the max value is mapped to 0 in unsigned 4 bit and the min value is close to 15 ()
        
        const float id = d ? 1.0f/d : 0.0f; //calculate inverse of d for faster mult later, also checks if d is zero, if yes, then all the value are zero, and no scaling is needed
 
		// Store the scale factor as float16
        y[i].d = GGML_FP32_TO_FP16(d);
 
		// Quantize the values in the block
        for (int j = 0; j < qk/2; ++j) {
	        // Process two values at a time because int4_t doesn't exist
            const float x0 = x[i*qk + 0    + j]*id;
            const float x1 = x[i*qk + qk/2 + j]*id;
            
		// Quantize to UNSIGNED 4-bit integers
		// Add 8.5f to move from signed to unsigned, 
		// then clamp between 0 and 15
            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
 
            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

General implementation of llama.cpp

// implementation for type-0 k-quants
static float make_qx_quants(
int n, // num weights within one block
int nmax, // max representable value for the bit format we're quantizing to
const float * restrict x, // input array
int8_t * restrict L, // Output array of quantized int8_t values
int rmse_type, // defines the weighted error metric
const float * restrict qw // optional array that  can be used to assign different importance to each element during the quantization process. 
)
  • The function first finds the maximum absolute value in the input array x.

  • If all values are close to zero (less than GROUP_MAX_EPS), it sets all quantized values to 0 and returns.

  • It calculates an initial scale factor iscale based on the maximum value and nmax. (like abs-max quantization)

    • If rmse_type is 0, it stops here, performs a simple quantization using the initial scale and returns.
  • For other rmse_type values, it performs a more complex quantization process:

    • It calculates the quantization error as weighted sums of the input values and their quantized representations.

      • Looping over for (int i = 0; i < n; ++i)
      • l is the quantized representation of x[i]
      • w is the importance factor (defined either by rmse_type or provided by qw[i])
        • rmse_type == 1: Weights error by square of input value (the default) i.e. x[i]*x[i]
        • rmse_type == 2: Uniform weighting
        • rmse_type == 3: Weights error by absolute value of input
        • rmse_type == 4: Weights error by square root of absolute value of input
      • sumlx += w*x[i]*l
      • suml2 += w*l*l
    • It computes an optimal scale factor that minimizes the quantization error.

      • This is float scale = suml2 ? sumlx/suml2 : 0
      • Proof:
        • . Quantization Model: Let be the original values, be the integer quantized values, and be the scale factor. The quantized representation of is .
        • Weighted Error: Let be the weight for each value. The weighted squared error is:
        • Minimizing Error: To find the optimal scale, we minimize with respect to :
        • Solving for Optimal Scale: From the above equation:
        • Correspondence to Code: In the code: is calculated as sumlx. is calculated as suml2
  • However, due to the discrete nature of quantization, this analytically derived optimal scale might not always give the absolute best results in practice. It thus then goes on to do scale refinement

    • The function then tries to improve this scale by testing 18 nearby values (9 above and 9 below the initial scale).
      • For each tested scale:
        • It requantizes the input values.
        • Recalculates sumlx and suml2.
        • Checks if this new scale produces a better result (higher sumlx*sumlx / suml2).
        • If a better scale is found, it updates the quantized values in L and the scale variable.
  • The quantized values are stored in the L array, offset by nmax to make them non-negative.

  • The function returns the final scale factor, which can be used to dequantize the values later.