Definitions

  • k-quants is a hierarchical quantization scheme. It works on blocks.

Blocks

  • The weights of a given layer are split into “super” blocks each containing a set of “sub” blocks.
    • Each sub-block computes its (maximum value) and scale factor AND also quantizes its underlying values.
    • Then the super-block takes the scale factors of each sub-block, and quantizes them, which requires the super-block to have a scale factor itself.

How the scale is found in llama.cpp

Quantization types

  • https://github.com/ggerganov/llama.cpp/pull/1684

  • In the existing ggml quantization types we have

    • ”type-0” (Q4_0, Q5_0)
      • weights w are obtained from quants q using w = d * q, where d is the block scale.
      • (absmax-quantization)
    • “type-1” (Q4_1, Q5_1) In “type-1”,
      • weights are given by w = d * q + m, where m is the block minimum.
      • (asymmetric quantization)
  • In depth example:

    • GGML_TYPE_Q2_K - “type-1” 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
      • Computing bits per weights:
        • 2 bit per weight
        • for the scale and min of a single block
        • for the scale of the super-block (always in fp16)
        • which adds up to 2.5625 bits per weight