🀖 Harold's Notes

Search

SearchSearch
          • Abstract Syntax Tree
          • Einops
          • Einsum
          • Installing CUDA on RTX laptop
          • Neat tricks
          • pdb
          • SLURM
          • VScode debugger
          • ggml
          • Quantization formats
            • How to implement kernel ops in Pytorch
            • Kernels 101
            • Mamba scan kernel
            • PMPP book
            • PMPP book - Heterogenous data parallel computing
            • PMPP book - Introduction
            • PMPP book - Multidimensional grids and data
            • Nsight systems
          • Block Matrix Multiplication
          • CUDA Essentials
          • CUDA logical & physical layout
          • CUDA streams
          • Flash Attention forward pass
          • Prefix Sum - Scan algorithm
          • Tips
          • Bandwidth
          • Flash Attention
          • GPU architecture
          • GPU memory tiling
          • Hardware
          • Memory
          • Tensor cores
          • TFLOPS, MFU
          • Memory usage (VRAM)
          • ML instant math
          • Moving bits
          • Differential Distributed Primitives
          • Model definition
          • Nanotron code organisation
          • Parameter sharding
          • ZeRO-3 vs 3D Parallelism
            • FP8 training
            • FP8 Training Stability
            • FP8-LM Training FP8 Large Language Models
            • PyTorch FP8
            • Resources
          • Floating point numbers
          • Kahan summation
          • Rounding
          • Transformer Engine
                • GEMMs
                • XNNPACK in Executorch
              • How to write a partitioner & backend delegate
              • QNN (Qualcomm)
              • Custom C++ operator
              • Kernels
              • Workflow
              • Invoking the runtime
              • Utils for LLMs
            • Benchmarking
            • Core Aten ops
            • Edge Compilation
            • How it works
            • The 0-1 specialization problem in Pt2 export
            • torch.export()
          • Compiler stack
          • Launching a distributed training run
          • Pytorch Internals
          • torch.autograd
          • torch.compile
          • torch.distributed
          • Torchrun
            • Inter-node networking
            • Intra-node networking
              • Deepspeed Notes
              • DTensor
              • FSDP (Fully Sharded Data Parallel)
              • FSDP code
              • FSDP2
              • FSDP2 code
            • 3D Parallelism
            • Distributed primitives
            • Expert Parallelism
            • Model Parallelism
            • NCCL
            • Pipeline Parallelism
            • Sequence Parallelism
            • Tensor Parallelism
            • YaFSDP
            • Zero Redundancy Optimizer
            • MFU calculation
            • Pytorch memory profiling
          • Activation checkpointing
          • Backpropagation
          • Compilers
          • Diagnostics
          • Fused optimizer
          • Errors in ML (bias-variance tradeoff)
          • Generative vs discriminative modeling
          • Adam
          • Straight-Through Estimator
          • NTK
          • Representer theorem
        • Meta Learning
        • Random thoughts
          • DiT
          • Network architectures
          • Time embeddings
          • Consistency models
          • Discrete Diffusion (D3PM)
          • Guidance
          • KL-divergence
          • Score-based Generative models
          • VAEs
          • Variational diffusion
          • Convergence in Diffusion Models
          • Evaluation metrics
          • Training diffusion models
        • Novel stuff
        • The diffusion process
            • Beyond GPT
            • Details of GPTs
            • Long Context Attention
            • Positional Embeddings
            • Self-Attention
            • Transformer
          • Batch Normalization
          • Encoder-decoder models
          • Linear RNNs and State Space Models (SSMs)
          • Mixture of Experts (MoE)
              • AlpacaEval
              • HELM-instruct
              • Summary
              • FLAN (Finetuned Language Net)
              • TULU
              • LIMA
            • Chain of Thought or Reasoning
            • Data
            • Summary
            • Tricks
              • Magicoder - OSS Instruct
              • OpenCoder
              • Qwen2.5-Coder Technical Report
              • WizardCoder
            • Autonomous Data Selection with Language Models
            • Evol-Instruct
            • MAmmoTH2 - Scaling Instructions from the Web
            • 4M tokenization
            • Text Tokenization
            • VQ-VAE
          • MCQ normalization
          • Metrics
          • Multi-lingual
          • Task metrics
          • Control Vectors
          • LoRA
            • KV caching
            • vLLM
            • Guided generation
            • Lookahead decoding
            • Speculative decoding
            • Pruning
            • Sparsity
            • Basic sampling techniques
            • Chain-of-Thought (CoT) Paradigm
            • Entropy-based sampling
            • Looping prevention
            • Resources
          • Quantifying GEMM regimes - Arithmetic intensity
          • 4M
          • Breakdown of M-LLMs
          • Chameleon
          • MultiModN - Multimodal, Multi-Task, Interpretable Modular Networks
          • ViT (Vision Transformers)
          • Meditron
          • T5
            • GGUF quantization
            • i-quants
            • Importance matrix
            • k-quants implementation
            • Broad overview of PTQ
            • GPTQ-like methods
            • Hadamard matrices
            • Quantization methods with rotations
            • Using orthogonal matrices for better quantization - The math
            • Quantization-aware training (QAT)
          • Literature
          • Quantization basics
          • DPO
          • Open-source datasets
          • Preference data
          • Reward model
          • RL basics
          • RLHF finetuning
          • TRL
            • Implementing u-mup
            • mu-Transfer
            • Resources
            • Spectral mup
            • Spectral Norm
            • Summary
            • u-mup
            • Unit scaling
          • Chinchilla
          • Golden rules for scaling deep neural networks
          • Scaling Laws
          • Scaling Laws for Batch Size
          • Scaling Laws for Neural Language Models
          • Scaling Laws for Transfer
          • Masked modeling
          • Tricks to reduce instabilities
        • Linear Algebra
        • PCA
        • Architecture
        • Distributed Training
        • GPU programming
        • Inference
        • Misc
        • Multi-modal
        • Must-reads
        • Post-training & test-time compute
        • Profiling
        • PyTorch
        • Research
        • Scaling
        • Training & optimization
      • Useful tools
    Home

    ❯

    ML

    ❯

    Resources to read

    ❯

    Architecture

    Architecture

    Jan 03, 20251 min read

    • https://wandb.ai/dalle-mini/dalle-mini/reports/An-Evaluation-of-Transformer-Variants—VmlldzoxNjk4MTIw experiments on tweaks to transformer architecture for text-to-image

    • Very nice summary of model architecture by Songling Yang https://sustcsonglin.github.io/assets/pdf/talk_250117.pdf

    • Gated Delta Networks: Improving Mamba2 with Delta Rule

    • Transformer improvement Thread

      • ReFormer: O(n) memory down to O(1), same convergence Paper: https://arxiv.org/abs/2001.04451
      • MLP-Mixer: Outscales transformer in the high-data-regime while using a fraction of the parameters and runtime (2x with full-context L3.1-405B; 9x with 8192 tokens) Paper: https://arxiv.org/abs/2105.08050
      • WideNet: MoE uses too much memory, so why not share the experts across layers? WideNet reduces memory and gave me 2x end-to-end training speedups. Paper: https://arxiv.org/abs/2107.11817
    • Mechanistic Design and Scaling of Hybrid Architectures

    • Scaling Laws for Linear Complexity Language Models

    • Compute Better Spent: Replacing Dense Layers with Structured Matrices

    • An Empirical Study of Mamba-based Language Models

    • Scaling laws with structured layers https://arxiv.org/pdf/2410.02117

    Graph View

    Backlinks

    • No backlinks found

    Created with Quartz v4.2.3 © 2025