Engineering
General

Talk from Modal Labs CTO about data infrastructure, docker, â€¦ https://www.youtube.com/watch?v=3jJ1GhGkLY0

https://wizardzines.com/ (comic books for different programming stuff)
ML

torch.compile, the missing manual, https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit#heading=h.ivdr7fmrbeab

Universal Checkpointing https://x.com/stasbekman/status/1808287880781127930?s=12

dirty PoC on how to start multiple vLLM OpenAI servers with n DP and m TP on n*m GPUs for maximum generation throughput. https://x.com/_philschmid/status/1807518758736728451?s=12

Unsurprisingly, lots of papers on this yearâ€™s USENIX re: AI/LLM training and inference infra. USENIX/OSDI are pure systems & infrastructure conferences, not as theoretical as NeurIPS/ICLR/ICML/etc. so should be fun. https://t.co/CQpd11NQPx
Infra profiling

https://github.com/stas00/mlengineering MACHINE LEARNING ENGINEERING OPEN BOOK

Pytorch Benchmark https://pytorch.org/tutorials/recipes/recipes/benchmark.html

Nsight Compute for kernel profiling
 Nsight Compute Profiling Guide
 mcarilli/nsight.sh  Favorite nsight systems profiling commands for PyTorch scripts
 Profiling GPU Applications with Nsight Systems

There are multiple formulas to compute MFU & HFU which are more realistic than nvidiasmi and also gives you the information of how much performance you can still squeeze from the GPUs. the formula of the PaLM paper, which is the reference most projects use (Megatron, Nanogpt, â€¦). Im using the https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/torchtitan/utils.py#L123 implementation. Check https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L224 part of the code and https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L434 too. https://github.com/pytorch/torchtitan/pull/280 discussion is interesting too.

Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLCLLM, TensorRTLLM, and TGI https://www.bentoml.com/blog/benchmarkingllminferencebackends

pyspy python profiler, extremely low overhead, no code modification, can run on live production code, just run
pyspy top pid <pid>

stress test inference stack https://x.com/stasbekman/status/1844924617980510675?s=46
GPU concepts
 Explanations of GPUs https://timdettmers.com/2023/01/30/whichgpufordeeplearning/#Memory_Bandwidth
 Understanding NVIDIA GPU Performance: Utilization vs. Saturation (2023)
GPU programming
Beginner

Getting Started With CUDA for Python Programmers (jeremy howard)

Triton programming for Mamba https://srush.github.io/annotatedmamba/hard.html

Colfax tutorials (FA3 authors )https://research.colfaxintl.com
Advanced

Walkthrough of the Parallel Scan with Cuda

How to Optimize a CUDA Matmul Kernel for cuBLASlike Performance: a Worklog

Flash Attention implementation thatâ€™s just 100 lines of code and 30% faster
 ThunderKittens (TK), a simple DSL embedded within CUDA that makes it easy to express key technical ideas for building AI kernels. TK lets us write clean, easytounderstand code that maximizes GPU utilization â€” on all kinds of kernels!

Kernel for fast 2:4 sparsification (50% of zeros), and it is an order of magnitude faster than alternatives. When sparsifying the weights, it makes linear layers 30% faster when considering the FW+BW passes

Fused RMS Norm Triton https://github.com/pytorchlabs/appliedai/blob/main/kernels/triton/training/rms_norm/fused_rms_norm.py
Parallel programming (scans, â€¦)
 Parallelizing Complex Scans and Reductions
 method for automatically extracting parallel prefix programs from sequential loops, even in the presence of complicated conditional statements.
 Parallelizing nonlinear sequential models over the sequence length
 https://x.com/christopher/status/1811406837675163998?s=46.
nvmathpython
is a new @nvidia library that enables pythonic access to accelerated math ops from CUDAX Math. Itâ€™s currently in beta and includes ops for linear algebra and fast fourier transforms.
Sequence Parallelism  Long context
 Linear Attention Sequence Parallelism
 LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster.
 RingAttention
 StripedAttention
 BurstAttention
Training

Megatron blogpost (scattergather optimization, Performance microbenchmarks for pipeline parallelism, )

GSPMD: General and Scalable Parallelization for ML Computation Graphs
 GSPMD is now the fundamental component of JAX/TensorFlow distributed training and enables various optimizations with the XLA compiler to allow users to train their models efficiently in a large scale setting.
Inference

Mooncake: A KVCachecentric Disaggregated Architecture for LLM Serving

Nice blog on llm inference optimizations: https://vgel.me/posts/fasterinference/

Cut LLM costs by mixing GPU types

Tim Dettmers on quantization https://timdettmers.com/2022/08/17/llmint8andemergentfeatures/

SliceGPT: Weight Matrix Compression for LLMs https://huggingface.co/papers/2401.15024

vLLM FP8 support https://x.com/anyscalecompute/status/1811059148911693906?s=46

vLLM office hours/ videos https://neuralmagic.com/communityofficehours/

SGLang (used by xAi people for Grokmini) https://github.com/sglproject/sglang

You can build a custom torch dynamo backend for super efficient inference
Low precision
 Googleâ€™s report on Gemma2
 Initial takeaway: many tricks focus on training stability, particularly suitable for lowprecision scenarios, e.g, logit softcapping and sandwich layer normalization. Does this hint at int8 training being crucial?
 Effective Interplay between Sparsity and Quantization: From Theory to Practice
Inference
 From Decoding to MetaGeneration: Inferencetime Algorithms for Large Language Models
 Seems potentially useful for compressing the kv cache or developing alternative methods: https://arxiv.org/abs/2404.15574
Scaling
 Complete overview of MuP
 Simo Ryu guide to scaling from smallscale proxy https://cloneofsimo.notion.site/Whattodotoscaleup09e469d7c3444d6a90305397c38a46f5
 Our 12 scaling laws (for LLM knowledge capacity) are out: https://arxiv.org/abs/2404.05405. Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions
 MiniCPM: Unveiling the Potential of Endside Large Language Models
 DeepSeek LLM report (section 3.1)
Scaling
 Mechanistic Design and Scaling of Hybrid Architectures
 Scaling Laws for Linear Complexity Language Models
 Compute Better Spent: Replacing Dense Layers with Structured Matrices
 An Empirical Study of Mambabased Language Models
 Scaling laws with structured layers https://arxiv.org/pdf/2410.02117
Research
 Simo Ryu, list of research ideas: https://x.com/cloneofsimo/status/1807461666957013120
 Sim Ryu, list of insightful nn papers: https://github.com/cloneofsimo/insightfulnnpapers
 GiffMana thread on Distillation
 TTT
 Simplifying, Stabilizing and Scaling ContinuousTime Consistency Models https://arxiv.org/abs/2410.11081
 ViT improvements in tokenization https://x.com/wenhaoli29/status/1846217454059389410?s=46