-
LLM serving is a mixture of memory-bound and compute-bound iterations
- Weight-only quantization is good for memory-bound scenarios
- but may not be sufficient for speedups in all cases
- It’s quite different to large scale training, where you need to be quite careful not to become communication/memory-bound, given the systems are distributed
- Weight-only quantization is good for memory-bound scenarios
-
As batch or sequence length grows, serving goes from bandwidth to compute-bound
- The best performing algorithm maximizes the compute speed
Roofline plot of GEMM kernels performance on a H100
-
Simple Model - 3 Parameters
- Rate of computation - TFlop/s (property of the machine)
- Rate of data movements - TB/s (property of the machine)
- Arithmetic intensity - Flop/Byte (property of the algorithm)
- For a GEMM, arithmetic intensity = compute/memory =
- The arithmetic intensity grows as the matrices get larger
- For a GEMM, arithmetic intensity = compute/memory =
-
2 Regimes
-
Low arithmetic intensity = Bandwidth limited
- The computer is not doing much work per byte, it’s mostly waiting for the next bytes to arrive i.e. limited by rate of data movement
- If you do flop/byte,and your rate of data of movement is TB/s , then your performance is upper-bounded by TFlop/s
-
High arithmetic intensity = Compute limited
- The bytes arrive faster than we’re done processing, we’re limited by the rate of computation
- Upper-bounded by the rate of computation TFLOP/s
-