Scaling Laws - rules we expect to hold

Scaling Laws for Batch Size
- The optimal batch size is proportional to the gradient noise scale (the sum of the variances of the individual gradient components, divided by the global norm of the gradient). This tends to be large for noisy RL environments, and smaller for well defined tasks.
Scaling Laws for Neural Language Models
- Larger models are significantly more sample-efficient
  - such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence
- Performance depends strongly on scale, weakly on model shape
- Universality of training: Training curves follow predictable power laws, whose parameters are roughly independent of model size
  - By extrapolating the early part of the curve, we can predict the loss if trained much longer

More parameters

Why: more throughput
- for a fixed model size, increasing your number of effective FLOPs directly translates to more token processed per second ⇒ better model fixed amount of GPU hours

theoretical increase in FLOPs from bf16: 2x
Problems: numerical instability
Solution:
- per-tensor scaling FP8-LM Training FP8 Large Language Models
- u-mup, enforcing a variance of 1 for both activation and gradients (ensures that floating-point representations during training stay within the range of a given number format.)

Writing kernels allows for more optimized GPU utilization, which usually focuses on reducing the amount of memory transfers, as this is the main bottleneck for GPU computation
- leverage hardware-specific characteristics (e.g. Hopper TMA)
- fusing multiple operations together to limit memory transfers (e.g. adding residuals, scaling)

Why: Similar to above, for a fixed model capacity, reducing the required FLOPs directly translates to more tokens processed per second

”Adaptive width”
- Mixture of Experts (MoE)
- reduces required number of FLOPs in the feed-forward needed to be spent to process a given token
- validated at scale
”Adaptive depth”
- Mixture of Depth paper (tokens may skip some layers)
- not validated at scale (if DeepMind published it, it’s probably because it doesn’t work at scale, at least in its current form)
“Adaptive modality”
- MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
  - specific modality token goes to its specific experts
  - The insight driving our approach is the inherent heterogeneity of modalities: text and image have distinct information and redundancy patterns