GPU concepts
- https://modal.com/gpu-glossary/readme
- Explanations of GPUs https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/#Memory_Bandwidth
- Understanding NVIDIA GPU Performance: Utilization vs. Saturation (2023)
GPU programming
Beginner
-
Getting Started With CUDA for Python Programmers (jeremy howard)
-
Triton programming for Mamba https://srush.github.io/annotated-mamba/hard.html
-
Colfax tutorials (FA3 authors )https://research.colfax-intl.com
Advanced
-
Flash Attention derived and coded from first principles with Triton (Python)
-
Walkthrough of the Parallel Scan with Cuda
-
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
-
Flash Attention implementation that’s just 100 lines of code and 30% faster
- ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easy-to-understand code that maximizes GPU utilization — on all kinds of kernels!
-
Kernel for fast 2:4 sparsification (50% of zeros), and it is an order of magnitude faster than alternatives. When sparsifying the weights, it makes linear layers 30% faster when considering the FW+BW passes
-
Fused RMS Norm Triton https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/training/rms_norm/fused_rms_norm.py
Parallel programming (scans, …)
- Parallelizing Complex Scans and Reductions
- method for automatically extracting parallel prefix programs from sequential loops, even in the presence of complicated conditional statements.
- Parallelizing non-linear sequential models over the sequence length
- https://x.com/christopher/status/1811406837675163998?s=46.
nvmath-python
is a new @nvidia library that enables pythonic access to accelerated math ops from CUDA-X Math. It’s currently in beta and includes ops for linear algebra and fast fourier transforms.
Sequence Parallelism - Long context
- Linear Attention Sequence Parallelism
- LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster.
- RingAttention
- StripedAttention
- BurstAttention