🤖 Harold's Notes

Search

❯

❯

❯

❯

Quantifying GEMM regimes - Arithmetic intensity

Quantifying GEMM regimes - Arithmetic intensity

Sep 24, 20242 min read

https://www.youtube.com/watch?v=adA9AMu4_Kc
LLM serving is a mixture of memory-bound and compute-bound iterations
- Weight-only quantization is good for memory-bound scenarios
  - but may not be sufficient for speedups in all cases
- It’s quite different to large scale training, where you need to be quite careful not to become communication/memory-bound, given the systems are distributed
As batch or sequence length grows, serving goes from bandwidth to compute-bound
- The best performing algorithm maximizes the compute speed

Roofline plot of GEMM kernels performance on a H100

Simple Model - 3 Parameters
- Rate of computation - TFlop/s (property of the machine)
- Rate of data movements - TB/s (property of the machine)
- Arithmetic intensity - Flop/Byte (property of the algorithm)
  - For a GEMM, arithmetic intensity = compute/memory = $O (N^{3}) / O (N^{2}) = O (N)$
    - The arithmetic intensity grows as the matrices get larger
2 Regimes
- Low arithmetic intensity = Bandwidth limited
  - The computer is not doing much work per byte, it’s mostly waiting for the next bytes to arrive i.e. limited by rate of data movement
  - If you do $β$ flop/byte,and your rate of data of movement is $α$ TB/s , then your performance is upper-bounded by $α β$ TFlop/s
- High arithmetic intensity = Compute limited
  - The bytes arrive faster than we’re done processing, we’re limited by the rate of computation
  - Upper-bounded by the rate of computation $γ$ TFLOP/s

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025