🤖 Harold's Notes

Search

❯

❯

Resources to read

❯

Profiling

Dec 16, 20241 min read

https://wizardzines.com/ (comic books for different programming stuff)
- Profiling & tracing with perf
- Linux debugging tools you’ll love
https://github.com/stas00/ml-engineering MACHINE LEARNING ENGINEERING OPEN BOOK
Pytorch Benchmark https://pytorch.org/tutorials/recipes/recipes/benchmark.html
Nsight Compute for kernel profiling
- Nsight Compute Profiling Guide
- mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
- Profiling GPU Applications with Nsight Systems
Lecture 1 How to profile CUDA kernels in PyTorch
https://arthurchiao.art/blog/understanding-gpu-performance/
There are multiple formulas to compute MFU & HFU which are more realistic than nvidia-smi and also gives you the information of how much performance you can still squeeze from the GPUs. the formula of the PaLM paper, which is the reference most projects use (Megatron, Nanogpt, …). Im using the https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/torchtitan/utils.py#L123 implementation. Check https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L224 part of the code and https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L434 too. https://github.com/pytorch/torchtitan/pull/280 discussion is interesting too.
Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI https://www.bentoml.com/blog/benchmarking-llm-inference-backends
py-spy python profiler, extremely low overhead, no code modification, can run on live production code, just run py-spy top --pid <pid>
- making it work on SLURM
stress test inference stack https://x.com/stasbekman/status/1844924617980510675?s=46

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025