• Cache blocking and matrix blocking https://www.youtube.com/watch?v=G92BCtfTwOE

    • why we use transpose (C= AB, we transpose B such that it becomes “column-major” and reading a column will not result in many cache misses)
  • Cache pollution = data in the cache that we don’t need

  • Cache trashing = continuously fill the cache with data we don’t need

  • If you do matrix mul in the “classical way” i.e. take one row of A and dot-product with every column of B, you’ll end up doing quite a bit of cache trashing/pollution because you read the entire B matrix but it doesn’t fit into cache so it keeps getting evicted

    • SOLUTION ⇒ blocking the matrix product

Blocking the matrix product

  • Break the matmul into blocks
    • For the block illustrated below, we need to read the rows 0-3 and columns 0-3. If these rows and columns fit into cache, we will get fewer cache misses
    • You can then move on to the next block (keeping rows 0-3 and reading columns 4-7) OR do every block in parallel
      • You get a lot of cache misses when you read a new block. but once the rows/columns are cached, the remaining dot products are much faster to compute

Going further and loading only blocks