🤖 Harold's Notes

Search

SearchSearch
          • Abstract Syntax Tree
          • Einops
          • Einsum
          • Installing CUDA on RTX laptop
          • Neat tricks
          • pdb
          • SLURM
          • VScode debugger
          • ggml
          • Quantization formats
            • How to implement kernel ops in Pytorch
            • Kernels 101
            • Mamba scan kernel
            • PMPP book
            • PMPP book - Heterogenous data parallel computing
            • PMPP book - Introduction
            • PMPP book - Multidimensional grids and data
            • Nsight systems
          • Block Matrix Multiplication
          • CUDA Essentials
          • CUDA logical & physical layout
          • CUDA streams
          • Fenwick Tree
          • Flash Attention forward pass
          • Prefix Sum - Scan algorithm
          • Tips
          • Bandwidth
          • Flash Attention
          • GPU architecture
          • GPU memory tiling
          • Hardware
          • Memory
          • Tensor cores
          • TFLOPS, MFU
          • Memory usage (VRAM)
          • ML instant math
          • Moving bits
          • Differential Distributed Primitives
          • Model definition
          • Nanotron code organisation
          • Parameter sharding
          • ZeRO-3 vs 3D Parallelism
            • FP8 training
            • FP8 Training Stability
            • FP8-LM Training FP8 Large Language Models
            • PyTorch FP8
            • Resources
          • Floating point numbers
          • Kahan summation
          • Rounding
          • Transformer Engine
                • GEMMs
                • XNNPACK in Executorch
              • How to write a partitioner & backend delegate
              • QNN (Qualcomm)
              • Custom C++ operator
              • Kernels
              • Workflow
              • Invoking the runtime
              • Utils for LLMs
            • Benchmarking
            • Core Aten ops
            • Edge Compilation
            • How it works
            • The 0-1 specialization problem in Pt2 export
            • torch.export()
          • Compiler stack
          • Launching a distributed training run
          • Pytorch Internals
          • torch.autograd
          • torch.compile
          • torch.distributed
          • Torchrun
            • Inter-node networking
            • Intra-node networking
              • Deepspeed Notes
              • DTensor
              • FSDP (Fully Sharded Data Parallel)
              • FSDP code
              • FSDP2
              • FSDP2 code
            • 3D Parallelism
            • Distributed primitives
            • Expert Parallelism
            • Model Parallelism
            • NCCL
            • Pipeline Parallelism
            • Sequence Parallelism
            • Tensor Parallelism
            • YaFSDP
            • Zero Redundancy Optimizer
            • MFU calculation
            • Pytorch memory profiling
          • Activation checkpointing
          • Backpropagation
          • Compilers
          • Diagnostics
          • Fused optimizer
          • Errors in ML (bias-variance tradeoff)
          • Generative vs discriminative modeling
          • Adam
          • Muon
          • Straight-Through Estimator
          • NTK
          • Representer theorem
        • Meta Learning
        • Random thoughts
          • DiT
          • Network architectures
          • Time embeddings
          • Consistency models
          • Discrete Diffusion (D3PM)
          • Guidance
          • KL-divergence
          • Score-based Generative models
          • VAEs
          • Variational diffusion
          • Convergence in Diffusion Models
          • Evaluation metrics
          • Training diffusion models
        • Novel stuff
        • The diffusion process
            • Mixture of Experts (MoE) summary
            • Beyond GPT
            • Details of GPTs
            • Long Context Attention
            • Positional Embeddings
            • Self-Attention
            • Transformer
          • Batch Normalization
          • Encoder-decoder models
          • Linear RNNs and State Space Models (SSMs)
              • AlpacaEval
              • HELM-instruct
              • Summary
              • FLAN (Finetuned Language Net)
              • TULU
              • LIMA
            • Chain of Thought or Reasoning
            • Data
            • Summary
            • Tricks
              • Magicoder - OSS Instruct
              • OpenCoder
              • Qwen2.5-Coder Technical Report
              • WizardCoder
            • Autonomous Data Selection with Language Models
            • Evol-Instruct
            • MAmmoTH2 - Scaling Instructions from the Web
            • 4M tokenization
            • Text Tokenization
            • VQ-VAE
          • MCQ normalization
          • Metrics
          • Multi-lingual
          • Task metrics
          • Control Vectors
          • LoRA
            • KV caching
            • vLLM
            • Guided generation
            • Lookahead decoding
            • Speculative decoding
            • Pruning
            • Sparsity
            • Basic sampling techniques
            • Chain-of-Thought (CoT) Paradigm
            • Entropy-based sampling
            • Looping prevention
            • Resources
          • Quantifying GEMM regimes - Arithmetic intensity
          • 4M
          • Breakdown of M-LLMs
          • Chameleon
          • MultiModN - Multimodal, Multi-Task, Interpretable Modular Networks
          • ViT (Vision Transformers)
          • Meditron
          • T5
            • GGUF quantization
            • i-quants
            • Importance matrix
            • k-quants implementation
            • Broad overview of PTQ
            • EoRA - LoRA-style adapter quantization error compensation
            • GPTQ-like methods
            • Hadamard matrices
            • Quantization methods with rotations
            • Using orthogonal matrices for better quantization - The math
            • ParetoQ - Scaling Laws for Low-bit LLM Quantization
            • Quantization-aware training (QAT)
          • Literature
          • Quantization basics
          • DPO
          • Open-source datasets
          • Preference data
          • Reward model
          • RL basics
          • RLHF finetuning
          • TRL
            • Implementing u-mup
            • mu-Transfer
            • Resources
            • Spectral mup
            • Spectral Norm
            • Summary
            • u-mup
            • Unit scaling
          • Chinchilla
          • Golden rules for scaling deep neural networks
          • Scaling Laws
          • Scaling Laws for Batch Size
          • Scaling Laws for Neural Language Models
          • Scaling Laws for Transfer
          • Masked modeling
          • Tricks to reduce instabilities
        • Linear Algebra
        • PCA
        • Architecture
        • Distributed Training
        • GPU programming
        • Inference
        • Misc
        • Multi-modal
        • Must-reads
        • Post-training & test-time compute
        • Profiling
        • PyTorch
        • Research
        • Scaling
        • Training & optimization
      • Useful tools
    Home

    ❯

    ML

    ❯

    Resources to read

    ❯

    Profiling

    Profiling

    Dec 16, 20241 min read

    • https://wizardzines.com/ (comic books for different programming stuff)

      • Profiling & tracing with perf
      • Linux debugging tools you’ll love
    • https://github.com/stas00/ml-engineering MACHINE LEARNING ENGINEERING OPEN BOOK

    • Pytorch Benchmark https://pytorch.org/tutorials/recipes/recipes/benchmark.html

    • Nsight Compute for kernel profiling

      • Nsight Compute Profiling Guide
      • mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
      • Profiling GPU Applications with Nsight Systems
    • Lecture 1 How to profile CUDA kernels in PyTorch

    • https://arthurchiao.art/blog/understanding-gpu-performance/

    • There are multiple formulas to compute MFU & HFU which are more realistic than nvidia-smi and also gives you the information of how much performance you can still squeeze from the GPUs. the formula of the PaLM paper, which is the reference most projects use (Megatron, Nanogpt, …). Im using the https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/torchtitan/utils.py#L123 implementation. Check https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L224 part of the code and https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L434 too. https://github.com/pytorch/torchtitan/pull/280 discussion is interesting too.

    • Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI https://www.bentoml.com/blog/benchmarking-llm-inference-backends

    • py-spy python profiler, extremely low overhead, no code modification, can run on live production code, just run py-spy top --pid <pid>

      • making it work on SLURM
    • stress test inference stack https://x.com/stasbekman/status/1844924617980510675?s=46

    Graph View

    Profiling

    Backlinks

    • No backlinks found

    Created with Quartz v4.2.3 © 2025