🤖 Harold's Notes

Search

SearchSearch
          • Abstract Syntax Tree
          • Einops
          • Einsum
          • Installing CUDA on RTX laptop
          • Neat tricks
          • pdb
          • SLURM
          • VScode debugger
          • ggml
          • Quantization formats
            • How to implement kernel ops in Pytorch
            • Kernels 101
            • Mamba scan kernel
            • PMPP book
            • PMPP book - Heterogenous data parallel computing
            • PMPP book - Introduction
            • PMPP book - Multidimensional grids and data
            • Nsight systems
          • Block Matrix Multiplication
          • CUDA Essentials
          • CUDA logical & physical layout
          • CUDA streams
          • Flash Attention forward pass
          • Prefix Sum - Scan algorithm
          • Tips
          • Bandwidth
          • Flash Attention
          • GPU architecture
          • GPU memory tiling
          • Hardware
          • Memory
          • Tensor cores
          • TFLOPS, MFU
          • Memory usage (VRAM)
          • ML instant math
          • Moving bits
          • Differential Distributed Primitives
          • Model definition
          • Nanotron code organisation
          • Parameter sharding
          • ZeRO-3 vs 3D Parallelism
            • FP8 training
            • FP8 Training Stability
            • FP8-LM Training FP8 Large Language Models
            • PyTorch FP8
            • Resources
          • Floating point numbers
          • Kahan summation
          • Rounding
          • Transformer Engine
                • GEMMs
                • XNNPACK in Executorch
              • How to write a partitioner & backend delegate
              • QNN (Qualcomm)
              • Custom C++ operator
              • Kernels
              • Workflow
              • Invoking the runtime
              • Utils for LLMs
            • Benchmarking
            • Core Aten ops
            • Edge Compilation
            • How it works
            • The 0-1 specialization problem in Pt2 export
            • torch.export()
          • Compiler stack
          • Launching a distributed training run
          • Pytorch Internals
          • torch.autograd
          • torch.compile
          • torch.distributed
          • Torchrun
            • Inter-node networking
            • Intra-node networking
              • Deepspeed Notes
              • DTensor
              • FSDP (Fully Sharded Data Parallel)
              • FSDP code
              • FSDP2
              • FSDP2 code
            • 3D Parallelism
            • Distributed primitives
            • Expert Parallelism
            • Model Parallelism
            • NCCL
            • Pipeline Parallelism
            • Sequence Parallelism
            • Tensor Parallelism
            • YaFSDP
            • Zero Redundancy Optimizer
            • MFU calculation
            • Pytorch memory profiling
          • Activation checkpointing
          • Backpropagation
          • Compilers
          • Diagnostics
          • Fused optimizer
          • Errors in ML (bias-variance tradeoff)
          • Generative vs discriminative modeling
          • Adam
          • Straight-Through Estimator
          • NTK
          • Representer theorem
        • Meta Learning
        • Random thoughts
          • DiT
          • Network architectures
          • Time embeddings
          • Consistency models
          • Discrete Diffusion (D3PM)
          • Guidance
          • KL-divergence
          • Score-based Generative models
          • VAEs
          • Variational diffusion
          • Convergence in Diffusion Models
          • Evaluation metrics
          • Training diffusion models
        • Novel stuff
        • The diffusion process
            • Beyond GPT
            • Details of GPTs
            • Long Context Attention
            • Positional Embeddings
            • Self-Attention
            • Transformer
          • Batch Normalization
          • Encoder-decoder models
          • Linear RNNs and State Space Models (SSMs)
          • Mixture of Experts (MoE)
              • AlpacaEval
              • HELM-instruct
              • Summary
              • FLAN (Finetuned Language Net)
              • TULU
              • LIMA
            • Chain of Thought or Reasoning
            • Data
            • Summary
            • Tricks
              • Magicoder - OSS Instruct
              • OpenCoder
              • Qwen2.5-Coder Technical Report
              • WizardCoder
            • Autonomous Data Selection with Language Models
            • Evol-Instruct
            • MAmmoTH2 - Scaling Instructions from the Web
            • 4M tokenization
            • Text Tokenization
            • VQ-VAE
          • MCQ normalization
          • Metrics
          • Multi-lingual
          • Task metrics
          • Control Vectors
          • LoRA
            • KV caching
            • vLLM
            • Guided generation
            • Lookahead decoding
            • Speculative decoding
            • Pruning
            • Sparsity
            • Basic sampling techniques
            • Chain-of-Thought (CoT) Paradigm
            • Entropy-based sampling
            • Looping prevention
            • Resources
          • Quantifying GEMM regimes - Arithmetic intensity
          • 4M
          • Breakdown of M-LLMs
          • Chameleon
          • MultiModN - Multimodal, Multi-Task, Interpretable Modular Networks
          • ViT (Vision Transformers)
          • Meditron
          • T5
            • GGUF quantization
            • i-quants
            • Importance matrix
            • k-quants implementation
            • Broad overview of PTQ
            • GPTQ-like methods
            • Hadamard matrices
            • Quantization methods with rotations
            • Using orthogonal matrices for better quantization - The math
            • Quantization-aware training (QAT)
          • Literature
          • Quantization basics
          • DPO
          • Open-source datasets
          • Preference data
          • Reward model
          • RL basics
          • RLHF finetuning
          • TRL
            • Implementing u-mup
            • mu-Transfer
            • Resources
            • Spectral mup
            • Spectral Norm
            • Summary
            • u-mup
            • Unit scaling
          • Chinchilla
          • Golden rules for scaling deep neural networks
          • Scaling Laws
          • Scaling Laws for Batch Size
          • Scaling Laws for Neural Language Models
          • Scaling Laws for Transfer
          • Masked modeling
          • Tricks to reduce instabilities
        • Linear Algebra
        • PCA
        • Architecture
        • Distributed Training
        • GPU programming
        • Inference
        • Misc
        • Multi-modal
        • Must-reads
        • Post-training & test-time compute
        • Profiling
        • PyTorch
        • Research
        • Scaling
        • Training & optimization
      • Useful tools
    Home

    ❯

    ML

    ❯

    Resources to read

    ❯

    Scaling

    Scaling

    Dec 16, 20241 min read

    • Simo Ryu guide to scaling from small-scale proxy https://cloneofsimo.notion.site/What-to-do-to-scale-up-09e469d7c3444d6a90305397c38a46f5

    • Scaling Book - A Systems View of LLMs on TPUs (very good read)

    • Google’s report on Gemma2

      • Initial takeaway: many tricks focus on training stability, particularly suitable for low-precision scenarios, e.g, logit soft-capping and sandwich layer normalization. Does this hint at int8 training being crucial?

    Other architectures

    Graph View

    Backlinks

    • No backlinks found

    Created with Quartz v4.2.3 © 2025