Symmetric quantization

  • Quantized value for zero = full precision zero

Absolute maximum (absmax) quantization

  • Given a list of values, we take the highest absolute value, , as the range to perform the linear mapping
    • e.g. FP32 INT8
  • = max representable value of your format
  • scale factor
  • quantization error = (in original precision)

Asymmetric quantization

  • It maps the minimum () and maximum () values from the float range to the minimum and maximum values of the quantized range.
  • Example with zeropoint quantization
    • = max representable value of your format
    • = max representable value of your format
    • scale factor
    • zeropoint
      • e.g. gets mapped to

Outliers

Clipping

  • If your vector has an outlier, then using “naive” quantization can lead to most values being mapped to the same spot in the band

  • We need to clip the outlier e.g. in FP32

  • How do you choose the clipping range?

    • For weights and biases

      • Manually choosing a percentile of the input
      • Optimize the mean squared error (MSE) between the original and quantized weights.
      • Minimizing entropy (KL-divergence) between the original and quantized values
    • For activations

      • Unlike weights, activations vary with each input data fed into the model during inference, making it challenging to quantize them accurately.