• https://www.youtube.com/watch?v=adA9AMu4_Kc

  • LLM serving is a mixture of memory-bound and compute-bound iterations

    • Weight-only quantization is good for memory-bound scenarios
      • but may not be sufficient for speedups in all cases
    • It’s quite different to large scale training, where you need to be quite careful not to become communication/memory-bound, given the systems are distributed
  • As batch or sequence length grows, serving goes from bandwidth to compute-bound

    • The best performing algorithm maximizes the compute speed

Roofline plot of GEMM kernels performance on a H100

  • Simple Model - 3 Parameters

    • Rate of computation - TFlop/s (property of the machine)
    • Rate of data movements - TB/s (property of the machine)
    • Arithmetic intensity - Flop/Byte (property of the algorithm)
      • For a GEMM, arithmetic intensity = compute/memory =
        • The arithmetic intensity grows as the matrices get larger
  • 2 Regimes

    • Low arithmetic intensity = Bandwidth limited

      • The computer is not doing much work per byte, it’s mostly waiting for the next bytes to arrive i.e. limited by rate of data movement
      • If you do flop/byte,and your rate of data of movement is TB/s , then your performance is upper-bounded by TFlop/s
    • High arithmetic intensity = Compute limited

      • The bytes arrive faster than we’re done processing, we’re limited by the rate of computation
      • Upper-bounded by the rate of computation TFLOP/s