🤖 Harold's Notes

Search

❯

❯

Resources to read

❯

GPU programming

GPU programming

Dec 16, 20242 min read

GPU concepts

https://modal.com/gpu-glossary/readme
Explanations of GPUs https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/#Memory_Bandwidth
Understanding NVIDIA GPU Performance: Utilization vs. Saturation (2023)
- Stas Bekman on that topic

GPU programming

# CUDA MODE Resource Stream
https://christianjmills.com/series/notes/cuda-mode-notes.html

Beginner

Triton official tutorial
High Quality Resources on GPU Programming/Architecture
Getting Started With CUDA for Python Programmers (jeremy howard)
Triton programming for Mamba https://srush.github.io/annotated-mamba/hard.html
What Every Developer Should Know About GPU Computing
A history of NVidia Stream Multiprocessor
GPUs Part 1 - Understanding GPU internals
GPUs Part 2 - Understanding the GPU programming model
Colfax tutorials (FA3 authors )https://research.colfax-intl.com
Modern GPU Architecture - youtube playlist

Advanced

Flash Attention derived and coded from first principles with Triton (Python)
Walkthrough of the Parallel Scan with Cuda
- https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda#:~:text=39.2%20Implementation
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
CUTLASS ping-pong GEMM kernel deep-dive
CUTLASS Tutorial: Persistent Kernels and Stream-K
A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
Tri Dao discussion on stability of Flash Attention
Flash Attention implementation that’s just 100 lines of code and 30% faster
- ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easy-to-understand code that maximizes GPU utilization — on all kinds of kernels!
Kernel for fast 2:4 sparsification (50% of zeros), and it is an order of magnitude faster than alternatives. When sparsifying the weights, it makes linear layers 30% faster when considering the FW+BW passes
Deep Dive on the Hopper TMA Unit for FP8 GEMMs
Fused RMS Norm Triton https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/training/rms_norm/fused_rms_norm.py

Parallel programming (scans, …)

Parallelizing Complex Scans and Reductions
- method for automatically extracting parallel prefix programs from sequential loops, even in the presence of complicated conditional statements.
Parallelizing non-linear sequential models over the sequence length
https://x.com/christopher/status/1811406837675163998?s=46. nvmath-python is a new @nvidia library that enables pythonic access to accelerated math ops from CUDA-X Math. It’s currently in beta and includes ops for linear algebra and fast fourier transforms.

Sequence Parallelism - Long context

Linear Attention Sequence Parallelism
- LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster.
RingAttention
- StripedAttention
- BurstAttention

Graph View

GPU concepts
GPU programming
Beginner
Advanced
Parallel programming (scans, …)
Sequence Parallelism - Long context

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025