Summary

There are three kinds of memory on the GPU:

  1. HBM/Global memory - This can be thought of as the equivalent of CPU memory. This is the slowest and largest memory available on the GPU.
    1. For reference, H100 SXM has 80GB of HBM with 3 TB/s of bandwidth (i.e. it can transfer 3 TB per second either to or from HBM)
    2. This is where the model is loaded when we do model.to(device=‘cuda:0’)
  2. L2 Cache - Faster than HBM but limited in size. This is shared among all the SMs.
    1. For reference, H100 SXM has 50 MB (lol, in comparison to HBM) of L2 cache with 12 TB/s of bandwidth
  3. Shared memory - Fastest and smallest memory available on the GPU.
    1. Every SM has its shared memory and all the cores executing instructions in an SM have access to it.