Abstract

You need to think about matrix shapes both in terms of:
- Is it a nice power of 32 e.g. divisible by 32 ?
  - so that it fits nicely both in cache/shared memory
  - so that it fits nicely into tiles (e.g. a matrix 257x128) requires two tiles instead of 1
- is the total number of tiles just below or equal the number of SMs (or divisible by)
  - e.g. A100 has 108 SMs and a $1792 \times 1792$ matrix tile (256x128) grid has dimensions $1792/256 = 7$ and $1792/128 = 14$ . That gives us $14 * 7 = 98$ tiles.
While it’s super important that sequence length and hidden size and various other hyper parameters are high multiples of 2 (64, 128 and higher) to achieve the highest performance, because in most models the batch dimension is flattened with the sequence length dimension during the compute, the micro batch size alignment usually has little to no impact on performance

What Shapes Do Matrix Multiplications Like?

Sources: 1, 2, 3

Memory Layout of Tiling

Let’s think about how our matrix’s memory layout looks like when our size is a multiple of the cache line (pretend it’s 128 bytes or 4 fp32 numbers). We’ll show 3-4 cache lines
Each matrix row starts on a cache line ! If we need the second row, we can just load the 3 cache lines that the yellow elements are a part of
What happens if we increase the number of elements per row from 12 to 13
With the unaligned layout, things are much messier. For example, in order to load the first 4 green elements, we must issue 2 loads! One that gets the last blue element + the first 3 green elements, and one that gets the 4th

GPU matrix mul implementation

GPUS implement matrix mul by partitioning the output matrix into tiles, which are then assigned to thread blocks.
Tile size, in this guide, usually refers to the dimensions of these tiles. Each thread block computes its output tile by stepping through the K dimension in tiles, loading the required values from the A and B matrices, and multiplying and accumulating them into the output.
- $A B = C$ where the dimensions are $M \times K \times K \times N = M \times N$

Quantization

Main idea

Let’s say we have N parallel tasks (which each take a second) and N CPUs.  
Q: How long does it perform to take all tasks?  
A: 1 second

Q: What about if we have (N+1) parallel tasks, and N CPUs?  
A: 2 seconds(!) Now, one CPU must perform two tasks, taking a total of 2 seconds.

Wave quantization

Wave quantization is the example above, except with CPUs ⇒ SMs and tasks ⇒ thread blocks.
- More details as to what these are in GPU architecture.
Here the total number of tiles is quantized to the number of multiprocessors on the GPU

Applied example

An NVIDIA A100 GPU has 108 SMs; in the particular case of 256x128 thread block tiles, it can execute one thread block per SM, leading to a wave size of 108 tiles that can execute simultaneously. Thus, GPU utilization will be highest when the number of tiles is an integer multiple of 108 or just below.
- For square multiplication $A B$ where $A, B$ are $N \times N$
- If $N = 1792$ , our tile grid has dimensions $1792/256 = 7$ and $1792/128 = 14$ . That gives us 98 tiles. Since an A100 has 108 SMs, that’s still one wave
- If $N = 1793$ , we get $(7 + 1) * (14 + 1) = 120$ tiles, or 2 waves!
As your matrix size increases, the total number of tiles/blocks increases. When this crosses a multiple of the # of SMs, your perf drops since you need to execute an additional “wave”.

🤖 Harold's Notes

Explorer

GPU memory tiling

Abstract

What Shapes Do Matrix Multiplications Like?

Memory Layout of Tiling

GPU matrix mul implementation

Quantization

Main idea

Wave quantization

Applied example

Graph View

Table of Contents

Backlinks