Definition: how quickly you can move memory from one place to another
- CPU to GPU
- one node to another
- from CUDA global memory to CUDA shared memory (critical)
  - referred to as “bandwidth cost”

GPU memory

Divided into
- DRAM (lots of it but slow) (Memory)
- SRAM (small but fast) (compute)
  - “cache memory in GPUs”
    - L1, L2 cache, register memory,
    - shared memory in CUDA cores
DRAM is what shows up in nvidia-smi
Every single time we perform a GPU kernel, we need to move data from DRAM to SRAM and back.
Location:
- SRAM: On-chip, close to the processing units
- HBM: Off-chip, but closer than traditional GDDR memory
Capacity:
- SRAM: Smaller capacity (KB to MB range)
- HBM: Larger capacity (GB range)
Speed:
- SRAM: Lowest latency, but limited bandwidth due to size constraints
- HBM: Higher latency than SRAM, but much higher bandwidth
Cost:
- SRAM: Most expensive per bit
- HBM: Less expensive than SRAM, but more expensive than GDDR
Use case:
- SRAM: Cache and temporary storage for immediate processing needs
- HBM: Main graphics memory for storing textures, frame buffers, and other large data sets

Bandwidth memory costs

torch.cos is memory-bound, because you need get data from DRAM, perform a tiny bit of computation, and put it back in DRAM.

Operator fusion is simply fusing multiple operations together to make sure we’re not moving memory back and forth to perform each operation
- x1 = x.cos(), x2 = x1.cos() is bad
- x2 = x.cos().cos() is good and “fused”.
Existing compilers can often perform “simple” fusions - NVFuser and XLA being two examples.
Otherwise you write custom CUDA kernels.

If you have a sequence of pointwise operators in training, then both the forwards pass and the backwards pass consist entirely of pointwise operators, and your runtime is essentially proportional to the amount of memory you’re reading and writing.
- As such, the typical result of autograd looks something like this:
- Instead, we can optimize this by saving only the input to the forward pass, and recomputing the rest in backwards. Now, it looks like this:

an A100 has 1.5 terabytes/second of global memory bandwidth
Can perform 9.75 teraflops/second of compute for general operations
Can perfrom 320 teraflops/second for matmul
Computation for $N \times N$ square matmul
- memory accesses: $3 N^{2}$
- FLOPs: $2 N^{3}$
- time spent on memory: $3 N^{2}$ / 1.5 [TB/s]
- time spent on computation: $2 N^{3}$ / 320 [TFLOP/s]
- compute-bound: $2 N^{3} /320 > 3 N^{2} /1.5$ when $N > 320$
Computation for a unary operator
- If we’re using 32 bit floats (4 bytes), we can load 400 billion numbers in one second
- To perform a simple unary operator (multiplying a tensor by 2), we actually need to write the tensor back to memory
- Conclusion. For a unary operator not to be memory-bound, it needs to perform about a 100 hundred operations