-
non-linear quantization
-
Most existing llama.cpp
quantization types use a linear mapping between quants and de-quantized weights (i.e., x = a * q
or x = a * q + b
-
In the case of iquants, it hardcodes a different mapping into the space with a look-up table
- Quantization is done this way
- Dequantization is just a table-lookup