Scaling Laws  rules we expect to hold
 Scaling Laws for Batch Size
 The optimal batch size is proportional to the gradient noise scale (the sum of the variances of the individual gradient components, divided by the global norm of the gradient). This tends to be large for noisy RL environments, and smaller for well defined tasks.
 Scaling Laws for Neural Language Models
 Larger models are significantly more sampleefficient
 such that optimally computeefficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence
 Performance depends strongly on scale, weakly on model shape
 Universality of training: Training curves follow predictable power laws, whose parameters are roughly independent of model size
 By extrapolating the early part of the curve, we can predict the loss if trained much longer
 Larger models are significantly more sampleefficient
More parameters
 Why: better performance
 Problems: instabilities (loss spikes, tuning)
 Solution:
 qklayernorm, zloss (Tricks to reduce instabilities)
 keeps attention logits and outputs logits from diverging
 muTransfer, which gives us
 feature learning stability ( i.e. “typical element size” of vectors $h_{ℓ}(x)$ and $Δh_{ℓ}(x)$ is $Θ(1)$ with respect to width $n_{ℓ}$.)
 hyperparameter stability, which ensures that the optimal HPs for small models remain unchanged as the model size grows.
 qklayernorm, zloss (Tricks to reduce instabilities)
More FLOPs
 Why: more throughput
 for a fixed model size, increasing your number of effective FLOPs directly translates to more token processed per second ⇒ better model fixed amount of GPU hours
Solution: FP8

theoretical increase in FLOPs from bf16: 2x

Problems: numerical instability

Solution:
 pertensor scaling FP8LM Training FP8 Large Language Models
 umup, enforcing a variance of 1 for both activation and gradients (ensures that floatingpoint representations during training stay within the range of a given number format.)
Solution: Kernels
 Writing kernels allows for more optimized GPU utilization, which usually focuses on reducing the amount of memory transfers, as this is the main bottleneck for GPU computation
 leverage hardwarespecific characteristics (e.g. Hopper TMA)
 fusing multiple operations together to limit memory transfers (e.g. adding residuals, scaling)
Less FLOPs for same capacity
 Why: Similar to above, for a fixed model capacity, reducing the required FLOPs directly translates to more tokens processed per second
Solution: Adaptive computation
 ”Adaptive width”
 Mixture of Experts (MoE)
 reduces required number of FLOPs in the feedforward needed to be spent to process a given token
 validated at scale
 ”Adaptive depth”
 Mixture of Depth paper (tokens may skip some layers)
 not validated at scale (if DeepMind published it, it’s probably because it doesn’t work at scale, at least in its current form)
 “Adaptive modality”
 MoMa: Efficient EarlyFusion Pretraining with Mixture of ModalityAware Experts
 specific modality token goes to its specific experts
 The insight driving our approach is the inherent heterogeneity of modalities: text and image have distinct information and redundancy patterns
 MoMa: Efficient EarlyFusion Pretraining with Mixture of ModalityAware Experts