🤖 Harold's Notes
Search
Search
Search
Explorer
ML
Engineering
Coding
Abstract Syntax Tree
Einops
Einsum
Neat tricks
pdb
SLURM
VScode debugger
CPUs
ggml
Quantization formats
GPU programming
Kernels implementation
How to implement kernel ops in Pytorch
Kernels 101
Mamba scan kernel
PMPP book
PMPP book
PMPP book - Heterogenous data parallel computing
PMPP book - Introduction
PMPP book - Multidimensional grids and data
Profiling
Nsight systems
Block Matrix Multiplication
CUDA Essentials
CUDA logical & physical layout
CUDA streams
Prefix Sum - Scan algorithm
Tips
GPUs
Bandwidth
Flash Attention
GPU architecture
GPU memory tiling
Hardware
Memory
Tensor cores
TFLOPS, MFU
Memory
Memory usage (VRAM)
ML instant math
Moving bits
Nanotron
Differential Distributed Primitives
Model definition
Nanotron code organisation
Parameter sharding
ZeRO-3 vs 3D Parallelism
Precision
FP8
FP8 training
FP8 Training Stability
FP8-LM Training FP8 Large Language Models
PyTorch FP8
Resources
Floating point numbers
Kahan summation
Rounding
Transformer Engine
Pytorch
ExecuTorch
Backends
Backend and delegation
QNN (Qualcomm)
XNNPACK
Kernels
Custom C++ operator
Kernels
Llama implementation
Workflow
LLM workloads
Invoking the runtime
Utils for LLMs
Benchmarking
Core Aten ops
Edge Compilation
How it works
The 0-1 specialization problem in Pt2 export
torch.export()
Compiler stack
Launching a distributed training run
Pytorch Internals
torch.autograd
torch.compile
torch.distributed
Torchrun
Training
Networking
Inter-node networking
Intra-node networking
Parallelism
DeepSpeed
Deepspeed Notes
FSDP
DTensor
FSDP (Fully Sharded Data Parallel)
FSDP code
FSDP2
FSDP2 code
3D Parallelism
Distributed primitives
Expert Parallelism
Model Parallelism
NCCL
Pipeline Parallelism
Sequence Parallelism
Tensor Parallelism
YaFSDP
Zero Redundancy Optimizer
Profiling
MFU calculation
Pytorch memory profiling
Activation checkpointing
Backpropagation
Compilers
Diagnostics
Fused optimizer
General
Classical ML
Errors in ML (bias-variance tradeoff)
Generative vs discriminative modeling
Optimizers
Adam
Straight-Through Estimator
Scaling
mu-Transfer
Implementing u-mup
mu-Transfer
Resources
Spectral mup
Spectral Norm
Summary
u-mup
Unit scaling
Chinchilla
Golden rules for scaling deep neural networks
Scaling Laws
Scaling Laws for Batch Size
Scaling Laws for Neural Language Models
Scaling Laws for Transfer
Theory
NTK
Representer theorem
Meta Learning
Random thoughts
Generative modeling (Diffusion)
Architecture
DiT
Network architectures
Time embeddings
Frameworks - Theory
Consistency models
Discrete Diffusion (D3PM)
Guidance
KL-divergence
Score-based Generative models
VAEs
Variational diffusion
Training
Convergence in Diffusion Models
Evaluation metrics
Training diffusion models
Novel stuff
The diffusion process
LLMs
Data
SFT-IFT
Evaluation
AlpacaEval
HELM-instruct
Summary
Large-scale
FLAN (Finetuned Language Net)
TULU
Small-scale
LIMA
Chain of Thought or Reasoning
Data
Summary
Tricks
Synthetic Data
Code
Magicoder - OSS Instruct
OpenCoder
Qwen2.5-Coder Technical Report
WizardCoder
Autonomous Data Selection with Language Models
Evol-Instruct
MAmmoTH2 - Scaling Instructions from the Web
Tokenization
4M tokenization
Text Tokenization
VQ-VAE
Evaluation
MCQ normalization
Metrics
Multi-lingual
Task metrics
Finetuning
Control Vectors
LoRA
Inference
Caching
KV caching
vLLM
Generation algorithms
Guided generation
Lookahead decoding
Speculative decoding
Inference arch optimizations
Pruning
Sparsity
Quantization
llama.cpp
GGUF quantization
i-quants
Importance matrix
k-quants implementation
Post-training quantization (PTQ)
Quantization basics
Quantization-aware training (QAT)
Sampling
Basic sampling techniques
Chain-of-Thought (CoT) Paradigm
Entropy-based sampling
Looping prevention
Resources
Quantifying GEMM regimes - Arithmetic intensity
Multi-modality
4M
Breakdown of M-LLMs
Chameleon
MultiModN - Multimodal, Multi-Task, Interpretable Modular Networks
Other Architectures
Batch Normalization
Linear RNNs and State Space Models (SSMs)
Mixture of Experts (MoE)
ViT (Vision Transformers)
Papers & Technical Reports
Meditron
T5
RLHF
DPO
Open-source datasets
Preference data
Reward model
RL basics
RLHF finetuning
TRL
Training
Masked modeling
Tricks to reduce instabilities
Transformers
Beyond GPT
Details of GPTs
Encoder-decoder
Long Context Attention
Positional Embeddings
Self-Attention
Transformer
Maths
Linear Algebra
PCA
Reading List
Useful tools
Home
❯
ML
❯
Engineering
❯
Pytorch
❯
Compiler stack
Compiler stack
Jul 03, 2024
1 min read
Source:
https://www.youtube.com/watch?v=7t2cCmILRQs
Graph View
Backlinks
No backlinks found