🤖 Harold's Notes

Search

❯

❯

❯

❯

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models

Jul 03, 20243 min read

Summary

Short

Larger models are significantly more sample-efficient
- such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence
Performance depends strongly on scale, weakly on model shape
Universality of training: Training curves follow predictable power laws, whose parameters are roughly independent of model size
- By extrapolating the early part of the curve, we can predict the loss if trained much longer

Detailed

Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters $N$ (excluding embeddings), the size of the dataset $D$ , and the amount of compute $C$ used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width
Smooth power laws: Performance has a power-law (i.e. logarithmic) with each of the three scale factors $N, D, C$ , when not bottlenecked by the other two
- low $N$ ⇒ capacity too low
- low $D$ ⇒ not enough data to generalize
- low $C$ ⇒ underfitting
Universality of overfitting: Performance improves predictably as long as we scale up $N$ and $D$ in tandem.
- diminishing returns if $N$ or $D$ is fixed, while the other increases ⇒ good argument for diff-dataset
- The performance penalty depends predictably on the ratio $\frac{N ^{0.74}}{D}$
- This means, for example, that if we increase model size 8x, we only need to increase the data by roughly 5x to avoid a penalty.
  - $N^{'} = 8 N$ ⇒ ratio increases by $8^{0.74} \approx 4.65$
Universality of training: Training curves follow predictable power laws, whose parameters are roughly independent of model size - By extrapolating the early part of the curve, we can predict the loss if trained much longer
OOD transfer improves with IID test performance: transfer to a different distribution incurs a roughly constant offset in the loss, but otherwise improves roughly in line with performance on the training set.
Sample efficiency: Larger models are more sample-efficient i.e. they reach the same level of performance with fewer optimization steps/tokens processed .
- However this may not be training compute optimal !
  - Indeed, given two models $m_{1}$ and $m_{2}$ , $p a r am s (m_{2}) = 1 0^{3} \cdot p a r am s (m_{1})$ , e.g. a 1M model and 1B model, let’s say $m_{2}$ achieves the same loss as $m_{1}$ in $1 0^{2}$ times less samples, it’s still (approximately) $10$ times more compute intensive to fit $m_{2}$ .
  - Caveat, the smaller model may never achieve the same loss target we have in mind, so the compute optimality is loss-target dependent
Training until convergence is not efficient for compute optimality: Given a fixed compute budget $C$ , we attain optimal performance by training very large models and stopping significantly short of convergence
- Note that this may not be in contradiction with Chinchilla
- This corroborates with the current “single-epoch” framework, when $D$ is very large.
- Data requirements grow very slowly w.r.t training compute $C$ i.e. $D \sim C^{0.27}$
  - 10x the compute $C$ , only 2x the data $D$

Graph View

Summary
Short
Detailed

Backlinks

Golden rules for scaling deep neural networks
Scaling Laws

Created with Quartz v4.2.3 © 2024