i-quants

non-linear quantization
Most existing llama.cpp quantization types use a linear mapping between quants and de-quantized weights (i.e., x = a * q or x = a * q + b
In the case of iquants, it hardcodes a different mapping into the space with a look-up table
- Example for a 4bit quant

static const int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};

Quantization is done this way

float al = id*xb[j]; // scale to int8
// binary search to find the index of the closest value in kvalues_iq4nl to al
int l = best_index_int8(16, kvalues_iq4nl, al); // find closest index to defined mapping
 
Lb[j] = l; // the quantized rep is the index within the length 16 array
 
float q = values[l]; // the actual value is in int8

Dequantization is just a table-lookup

qs= quantized values
dl = scaling factor
 
y[j+ 0] = dl * kvalues_iq4nl[qs[j] & 0xf];
y[j+16] = dl * kvalues_iq4nl[qs[j] >> 4];