-
Cache blocking and matrix blocking https://www.youtube.com/watch?v=G92BCtfTwOE
- why we use transpose (C= AB, we transpose B such that it becomes “column-major” and reading a column will not result in many cache misses)
-
Cache pollution = data in the cache that we don’t need
-
Cache trashing = continuously fill the cache with data we don’t need
-
If you do matrix mul in the “classical way” i.e. take one row of A and dot-product with every column of B, you’ll end up doing quite a bit of cache trashing/pollution because you read the entire B matrix but it doesn’t fit into cache so it keeps getting evicted
- SOLUTION ⇒ blocking the matrix product
Blocking the matrix product
- Break the matmul into blocks
- For the block illustrated below, we need to read the rows 0-3 and columns 0-3. If these rows and columns fit into cache, we will get fewer cache misses
- You can then move on to the next block (keeping rows 0-3 and reading columns 4-7) OR do every block in parallel
- You get a lot of cache misses when you read a new block. but once the rows/columns are cached, the remaining dot products are much faster to compute