• https://lilianweng.github.io/posts/2023-01-10-inference-optimization/

  • Quantization-aware training fuses the quantization operation into the pre-training or fine-tuning process.

    • It learns model weights in low-bit representation directly and leads to better performance at the cost of additional training time and computation.
  • Finetuning: The most straightforward approach is to fine-tune the model after quantization on a training dataset that is the same as or representative of the pre-training dataset.

    • The training objective can be the same as the one for pre-training (e.g. NLL/MLM in general language model training) or specific to a downstream task that we care about (e.g. Cross entropy for classification).
  • Distillation: Another approach is to consider the full-precision model as the teacher and the lower-precision model as the student, and then optimize the low-precision model with distillation loss.

    • Distillation usually doesn’t need to use the original dataset; E.g. Wikipedia dataset is a good choice and even random tokens can give decent performance gain.
    • The Layer-by-layer Knowledge Distillation (LKD; Yao et al. 2022) method quantizes the network layer by layer and uses its original, unquantized version as the teacher model. Given the same inputs, LKD minimizes the MSE between the multiplication with layer weights and the multiplication of quantized layer weights.