• non-linear quantization

  • Most existing llama.cpp quantization types use a linear mapping between quants and de-quantized weights (i.e., x = a * q or x = a * q + b

  • In the case of iquants, it hardcodes a different mapping into the space with a look-up table

    • Example for a 4bit quant
static const int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};
  • Quantization is done this way
float al = id*xb[j]; // scale to int8
// binary search to find the index of the closest value in kvalues_iq4nl to al
int l = best_index_int8(16, kvalues_iq4nl, al); // find closest index to defined mapping
 
Lb[j] = l; // the quantized rep is the index within the length 16 array
 
float q = values[l]; // the actual value is in int8
  • Dequantization is just a table-lookup
qs= quantized values
dl = scaling factor
 
y[j+ 0] = dl * kvalues_iq4nl[qs[j] & 0xf];
y[j+16] = dl * kvalues_iq4nl[qs[j] >> 4];