Engineering
General
-
Talk from Modal Labs CTO about data infrastructure, docker, … https://www.youtube.com/watch?v=3jJ1GhGkLY0
-
https://wizardzines.com/ (comic books for different programming stuff)
ML
-
torch.compile, the missing manual, https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit#heading=h.ivdr7fmrbeab
-
Universal Checkpointing https://x.com/stasbekman/status/1808287880781127930?s=12
-
dirty PoC on how to start multiple vLLM OpenAI servers with n DP and m TP on n*m GPUs for maximum generation throughput. https://x.com/_philschmid/status/1807518758736728451?s=12
-
Unsurprisingly, lots of papers on this year’s USENIX re: AI/LLM training and inference infra. USENIX/OSDI are pure systems & infrastructure conferences, not as theoretical as NeurIPS/ICLR/ICML/etc. so should be fun. https://t.co/CQpd11NQPx
Infra profiling
-
https://github.com/stas00/ml-engineering MACHINE LEARNING ENGINEERING OPEN BOOK
-
Pytorch Benchmark https://pytorch.org/tutorials/recipes/recipes/benchmark.html
-
Nsight Compute for kernel profiling
- Nsight Compute Profiling Guide
- mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
- Profiling GPU Applications with Nsight Systems
-
There are multiple formulas to compute MFU & HFU which are more realistic than nvidia-smi and also gives you the information of how much performance you can still squeeze from the GPUs. the formula of the PaLM paper, which is the reference most projects use (Megatron, Nanogpt, …). Im using the https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/torchtitan/utils.py#L123 implementation. Check https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L224 part of the code and https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L434 too. https://github.com/pytorch/torchtitan/pull/280 discussion is interesting too.
-
Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI https://www.bentoml.com/blog/benchmarking-llm-inference-backends
-
py-spy python profiler, extremely low overhead, no code modification, can run on live production code, just run
py-spy top --pid <pid>
-
stress test inference stack https://x.com/stasbekman/status/1844924617980510675?s=46
GPU concepts
- Explanations of GPUs https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/#Memory_Bandwidth
- Understanding NVIDIA GPU Performance: Utilization vs. Saturation (2023)
GPU programming
Beginner
-
Getting Started With CUDA for Python Programmers (jeremy howard)
-
Triton programming for Mamba https://srush.github.io/annotated-mamba/hard.html
-
Colfax tutorials (FA3 authors )https://research.colfax-intl.com
Advanced
-
Walkthrough of the Parallel Scan with Cuda
-
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
-
Flash Attention implementation that’s just 100 lines of code and 30% faster
- ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easy-to-understand code that maximizes GPU utilization — on all kinds of kernels!
-
Kernel for fast 2:4 sparsification (50% of zeros), and it is an order of magnitude faster than alternatives. When sparsifying the weights, it makes linear layers 30% faster when considering the FW+BW passes
-
Fused RMS Norm Triton https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/training/rms_norm/fused_rms_norm.py
Parallel programming (scans, …)
- Parallelizing Complex Scans and Reductions
- method for automatically extracting parallel prefix programs from sequential loops, even in the presence of complicated conditional statements.
- Parallelizing non-linear sequential models over the sequence length
- https://x.com/christopher/status/1811406837675163998?s=46.
nvmath-python
is a new @nvidia library that enables pythonic access to accelerated math ops from CUDA-X Math. It’s currently in beta and includes ops for linear algebra and fast fourier transforms.
Sequence Parallelism - Long context
- Linear Attention Sequence Parallelism
- LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster.
- RingAttention
- StripedAttention
- BurstAttention
Training
-
Megatron blogpost (scatter-gather optimization, Performance microbenchmarks for pipeline parallelism, )
-
GSPMD: General and Scalable Parallelization for ML Computation Graphs
- GSPMD is now the fundamental component of JAX/TensorFlow distributed training and enables various optimizations with the XLA compiler to allow users to train their models efficiently in a large scale setting.
Inference
-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
-
Nice blog on llm inference optimizations: https://vgel.me/posts/faster-inference/
-
Cut LLM costs by mixing GPU types
-
Tim Dettmers on quantization https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
-
SliceGPT: Weight Matrix Compression for LLMs https://huggingface.co/papers/2401.15024
-
vLLM FP8 support https://x.com/anyscalecompute/status/1811059148911693906?s=46
-
vLLM office hours/ videos https://neuralmagic.com/community-office-hours/
-
SGLang (used by xAi people for Grok-mini) https://github.com/sgl-project/sglang
-
You can build a custom torch dynamo backend for super efficient inference
Low precision
- Google’s report on Gemma2
- Initial takeaway: many tricks focus on training stability, particularly suitable for low-precision scenarios, e.g, logit soft-capping and sandwich layer normalization. Does this hint at int8 training being crucial?
- Effective Interplay between Sparsity and Quantization: From Theory to Practice
Inference
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
- Seems potentially useful for compressing the kv cache or developing alternative methods: https://arxiv.org/abs/2404.15574
Scaling
- Complete overview of MuP
- Simo Ryu guide to scaling from small-scale proxy https://cloneofsimo.notion.site/What-to-do-to-scale-up-09e469d7c3444d6a90305397c38a46f5
- Our 12 scaling laws (for LLM knowledge capacity) are out: https://arxiv.org/abs/2404.05405. Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions
- MiniCPM: Unveiling the Potential of End-side Large Language Models
- DeepSeek LLM report (section 3.1)
Scaling
- Mechanistic Design and Scaling of Hybrid Architectures
- Scaling Laws for Linear Complexity Language Models
- Compute Better Spent: Replacing Dense Layers with Structured Matrices
- An Empirical Study of Mamba-based Language Models
- Scaling laws with structured layers https://arxiv.org/pdf/2410.02117
Research
- Simo Ryu, list of research ideas: https://x.com/cloneofsimo/status/1807461666957013120
- Sim Ryu, list of insightful nn papers: https://github.com/cloneofsimo/insightful-nn-papers
- GiffMana thread on Distillation
- TTT
- Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models https://arxiv.org/abs/2410.11081
- ViT improvements in tokenization https://x.com/wenhaoli29/status/1846217454059389410?s=46