Engineering

General

Talk from Modal Labs CTO about data infrastructure, docker, … https://www.youtube.com/watch?v=3jJ1GhGkLY0
https://wizardzines.com/ (comic books for different programming stuff)
- Profiling & tracing with perf
- Linux debugging tools you’ll love
The missing semester

ML

torch.compile, the missing manual, https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit#heading=h.ivdr7fmrbeab
Universal Checkpointing https://x.com/stasbekman/status/1808287880781127930?s=12
dirty PoC on how to start multiple vLLM OpenAI servers with n DP and m TP on n*m GPUs for maximum generation throughput. https://x.com/_philschmid/status/1807518758736728451?s=12
Unsurprisingly, lots of papers on this year’s USENIX re: AI/LLM training and inference infra. USENIX/OSDI are pure systems & infrastructure conferences, not as theoretical as NeurIPS/ICLR/ICML/etc. so should be fun. https://t.co/CQpd11NQPx

Infra profiling

https://github.com/stas00/ml-engineering MACHINE LEARNING ENGINEERING OPEN BOOK
Pytorch Benchmark https://pytorch.org/tutorials/recipes/recipes/benchmark.html
Nsight Compute for kernel profiling
- Nsight Compute Profiling Guide
- mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
- Profiling GPU Applications with Nsight Systems
Lecture 1 How to profile CUDA kernels in PyTorch
https://arthurchiao.art/blog/understanding-gpu-performance/
There are multiple formulas to compute MFU & HFU which are more realistic than nvidia-smi and also gives you the information of how much performance you can still squeeze from the GPUs. the formula of the PaLM paper, which is the reference most projects use (Megatron, Nanogpt, …). Im using the https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/torchtitan/utils.py#L123 implementation. Check https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L224 part of the code and https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L434 too. https://github.com/pytorch/torchtitan/pull/280 discussion is interesting too.
Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI https://www.bentoml.com/blog/benchmarking-llm-inference-backends
py-spy python profiler, extremely low overhead, no code modification, can run on live production code, just run py-spy top --pid <pid>
- making it work on SLURM
stress test inference stack https://x.com/stasbekman/status/1844924617980510675?s=46

GPU concepts

GPU programming

Beginner

Triton official tutorial
High Quality Resources on GPU Programming/Architecture
Getting Started With CUDA for Python Programmers (jeremy howard)
Triton programming for Mamba https://srush.github.io/annotated-mamba/hard.html
What Every Developer Should Know About GPU Computing
A history of NVidia Stream Multiprocessor
GPUs Part 1 - Understanding GPU internals
GPUs Part 2 - Understanding the GPU programming model
Colfax tutorials (FA3 authors )https://research.colfax-intl.com

Advanced

Walkthrough of the Parallel Scan with Cuda
- https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda#:~:text=39.2%20Implementation
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
CUTLASS ping-pong GEMM kernel deep-dive
A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
Tri Dao discussion on stability of Flash Attention
Flash Attention implementation that’s just 100 lines of code and 30% faster
- ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easy-to-understand code that maximizes GPU utilization — on all kinds of kernels!
Kernel for fast 2:4 sparsification (50% of zeros), and it is an order of magnitude faster than alternatives. When sparsifying the weights, it makes linear layers 30% faster when considering the FW+BW passes
Deep Dive on the Hopper TMA Unit for FP8 GEMMs
Fused RMS Norm Triton https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/training/rms_norm/fused_rms_norm.py

Parallel programming (scans, …)

Parallelizing Complex Scans and Reductions
- method for automatically extracting parallel prefix programs from sequential loops, even in the presence of complicated conditional statements.
Parallelizing non-linear sequential models over the sequence length
https://x.com/christopher/status/1811406837675163998?s=46. nvmath-python is a new @nvidia library that enables pythonic access to accelerated math ops from CUDA-X Math. It’s currently in beta and includes ops for linear algebra and fast fourier transforms.

Sequence Parallelism - Long context

Linear Attention Sequence Parallelism
- LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster.
RingAttention
- StripedAttention
- BurstAttention

Training

Megatron blogpost (scatter-gather optimization, Performance microbenchmarks for pipeline parallelism, )
GSPMD: General and Scalable Parallelization for ML Computation Graphs
- GSPMD is now the fundamental component of JAX/TensorFlow distributed training and enables various optimizations with the XLA compiler to allow users to train their models efficiently in a large scale setting.

Inference

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Nice blog on llm inference optimizations: https://vgel.me/posts/faster-inference/
Cut LLM costs by mixing GPU types
Large Transformer Model Inference Optimization
Tim Dettmers on quantization https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
SliceGPT: Weight Matrix Compression for LLMs https://huggingface.co/papers/2401.15024
vLLM FP8 support https://x.com/anyscalecompute/status/1811059148911693906?s=46
vLLM office hours/ videos https://neuralmagic.com/community-office-hours/
SGLang (used by xAi people for Grok-mini) https://github.com/sgl-project/sglang
You can build a custom torch dynamo backend for super efficient inference

Low precision

Google’s report on Gemma2
- Initial takeaway: many tricks focus on training stability, particularly suitable for low-precision scenarios, e.g, logit soft-capping and sandwich layer normalization. Does this hint at int8 training being crucial?
Effective Interplay between Sparsity and Quantization: From Theory to Practice

Inference

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Seems potentially useful for compressing the kv cache or developing alternative methods: https://arxiv.org/abs/2404.15574

Scaling

Complete overview of MuP
Simo Ryu guide to scaling from small-scale proxy https://cloneofsimo.notion.site/What-to-do-to-scale-up-09e469d7c3444d6a90305397c38a46f5
Our 12 scaling laws (for LLM knowledge capacity) are out: https://arxiv.org/abs/2404.05405. Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions
MiniCPM: Unveiling the Potential of End-side Large Language Models
DeepSeek LLM report (section 3.1)

Scaling

Mechanistic Design and Scaling of Hybrid Architectures
Scaling Laws for Linear Complexity Language Models
Compute Better Spent: Replacing Dense Layers with Structured Matrices
An Empirical Study of Mamba-based Language Models
Scaling laws with structured layers https://arxiv.org/pdf/2410.02117

Research

Simo Ryu, list of research ideas: https://x.com/cloneofsimo/status/1807461666957013120
Sim Ryu, list of insightful nn papers: https://github.com/cloneofsimo/insightful-nn-papers
GiffMana thread on Distillation
TTT
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models https://arxiv.org/abs/2410.11081
ViT improvements in tokenization https://x.com/wenhaoli29/status/1846217454059389410?s=46

SCALING LAWS FOR GENERATIVE MIXED-MODAL LANGUAGE MODELS

🤖 Harold's Notes

Explorer

Reading List

Engineering

General

ML

Infra profiling

GPU concepts

GPU programming

Beginner

Advanced

Parallel programming (scans, …)

Sequence Parallelism - Long context

Training

Inference

Low precision

Inference

Scaling

Scaling

Research

Graph View

Table of Contents

Backlinks

🤖 Harold's Notes

Explorer

Reading List

Engineering

General

ML

Infra profiling

GPU concepts

GPU programming

Beginner

Advanced

Parallel programming (scans, …)

Sequence Parallelism - Long context

Training

Inference

Low precision

Inference

Scaling

Scaling

Research

Multi-modal

Graph View

Table of Contents

Backlinks