Engineering

General

ML

Infra profiling

GPU concepts

GPU programming

Beginner

Advanced

Parallel programming (scans, …)

Sequence Parallelism - Long context

  • Linear Attention Sequence Parallelism
    • LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster.
  • RingAttention
    • StripedAttention
    • BurstAttention

Training

Inference

Low precision

Inference

Scaling

Scaling

  • Mechanistic Design and Scaling of Hybrid Architectures
  • Scaling Laws for Linear Complexity Language Models
  • Compute Better Spent: Replacing Dense Layers with Structured Matrices
  • An Empirical Study of Mamba-based Language Models
  • Scaling laws with structured layers https://arxiv.org/pdf/2410.02117

Research

Multi-modal