
https://lilianweng.github.io/posts/20230110inferenceoptimization/

Quantizationaware training fuses the quantization operation into the pretraining or finetuning process.
 It learns model weights in lowbit representation directly and leads to better performance at the cost of additional training time and computation.

Finetuning: The most straightforward approach is to finetune the model after quantization on a training dataset that is the same as or representative of the pretraining dataset.
 The training objective can be the same as the one for pretraining (e.g. NLL/MLM in general language model training) or specific to a downstream task that we care about (e.g. Cross entropy for classification).

Distillation: Another approach is to consider the fullprecision model as the teacher and the lowerprecision model as the student, and then optimize the lowprecision model with distillation loss.
 Distillation usually doesn’t need to use the original dataset; E.g. Wikipedia dataset is a good choice and even random tokens can give decent performance gain.
 The Layerbylayer Knowledge Distillation (LKD; Yao et al. 2022) method quantizes the network layer by layer and uses its original, unquantized version as the teacher model. Given the same inputs, LKD minimizes the MSE between the multiplication with layer weights and the multiplication of quantized layer weights.