Summary
There are three kinds of memory on the GPU:
- HBM/Global memory - This can be thought of as the equivalent of CPU memory. This is the slowest and largest memory available on the GPU.
- For reference, H100 SXM has 80GB of HBM with 3 TB/s of bandwidth (i.e. it can transfer 3 TB per second either to or from HBM)
- This is where the model is loaded when we do model.to(device=‘cuda:0’)
- L2 Cache - Faster than HBM but limited in size. This is shared among all the SMs.
- For reference, H100 SXM has 50 MB (lol, in comparison to HBM) of L2 cache with 12 TB/s of bandwidth
- Shared memory - Fastest and smallest memory available on the GPU.
- Every SM has its shared memory and all the cores executing instructions in an SM have access to it.