• Definition: how quickly you can move memory from one place to another
    • CPU to GPU
    • one node to another
    • from CUDA global memory to CUDA shared memory (critical)
      • referred to as “bandwidth cost”

GPU memory

  • Divided into

    • DRAM (lots of it but slow) (Memory)
    • SRAM (small but fast) (compute)
      • “cache memory in GPUs”
        • L1, L2 cache, register memory,
        • shared memory in CUDA cores
  • DRAM is what shows up in nvidia-smi

  • Every single time we perform a GPU kernel, we need to move data from DRAM to SRAM and back.

  • Location:

    • SRAM: On-chip, close to the processing units
    • HBM: Off-chip, but closer than traditional GDDR memory
  • Capacity:

    • SRAM: Smaller capacity (KB to MB range)
    • HBM: Larger capacity (GB range)
  • Speed:

    • SRAM: Lowest latency, but limited bandwidth due to size constraints
    • HBM: Higher latency than SRAM, but much higher bandwidth
  • Cost:

    • SRAM: Most expensive per bit
    • HBM: Less expensive than SRAM, but more expensive than GDDR
  • Use case:

    • SRAM: Cache and temporary storage for immediate processing needs
    • HBM: Main graphics memory for storing textures, frame buffers, and other large data sets

Bandwidth memory costs

  • torch.cos is memory-bound, because you need get data from DRAM, perform a tiny bit of computation, and put it back in DRAM.

Operator fusion

  • Operator fusion is simply fusing multiple operations together to make sure we’re not moving memory back and forth to perform each operation
    • x1 = x.cos(), x2 = x1.cos() is bad
    • x2 = x.cos().cos() is good and “fused”.
  • Existing compilers can often perform “simple” fusions - NVFuser and XLA being two examples.
  • Otherwise you write custom CUDA kernels.

Activation checkpointing

  • If you have a sequence of pointwise operators in training, then both the forwards pass and the backwards pass consist entirely of pointwise operators, and your runtime is essentially proportional to the amount of memory you’re reading and writing.
    • As such, the typical result of autograd looks something like this:
    • Instead, we can optimize this by saving only the input to the forward pass, and recomputing the rest in backwards. Now, it looks like this:

Reasoning about Memory-Bandwidth Costs

  • an A100 has 1.5 terabytes/second of global memory bandwidth

  • Can perform 9.75 teraflops/second of compute for general operations

  • Can perfrom 320 teraflops/second for matmul

  • Computation for square matmul

    • memory accesses:
    • FLOPs:
    • time spent on memory: / 1.5 [TB/s]
    • time spent on computation: / 320 [TFLOP/s]
    • compute-bound: when
  • Computation for a unary operator

    • If we’re using 32 bit floats (4 bytes), we can load 400 billion numbers in one second
    • To perform a simple unary operator (multiplying a tensor by 2), we actually need to write the tensor back to memory
    • Conclusion. For a unary operator not to be memory-bound, it needs to perform about a 100 hundred operations