Summary

Short

  • Larger models are significantly more sample-efficient

    • such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence
  • Performance depends strongly on scale, weakly on model shape

  • Universality of training: Training curves follow predictable power laws, whose parameters are roughly independent of model size

    • By extrapolating the early part of the curve, we can predict the loss if trained much longer

Detailed

  • Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters (excluding embeddings), the size of the dataset , and the amount of compute used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width

  • Smooth power laws: Performance has a power-law (i.e. logarithmic) with each of the three scale factors , when not bottlenecked by the other two

    • low capacity too low
    • low not enough data to generalize
    • low underfitting
  • Universality of overfitting: Performance improves predictably as long as we scale up and in tandem.

    • diminishing returns if or is fixed, while the other increases good argument for diff-dataset
    • The performance penalty depends predictably on the ratio
    • This means, for example, that if we increase model size 8x, we only need to increase the data by roughly 5x to avoid a penalty.
      • ratio increases by
  • Universality of training: Training curves follow predictable power laws, whose parameters are roughly independent of model size - By extrapolating the early part of the curve, we can predict the loss if trained much longer

  • OOD transfer improves with IID test performance: transfer to a different distribution incurs a roughly constant offset in the loss, but otherwise improves roughly in line with performance on the training set.

  • Sample efficiency: Larger models are more sample-efficient i.e. they reach the same level of performance with fewer optimization steps/tokens processed .

    • However this may not be training compute optimal !
      • Indeed, given two models and , , e.g. a 1M model and 1B model, let’s say achieves the same loss as in times less samples, it’s still (approximately) times more compute intensive to fit .
      • Caveat, the smaller model may never achieve the same loss target we have in mind, so the compute optimality is loss-target dependent
  • Training until convergence is not efficient for compute optimality: Given a fixed compute budget , we attain optimal performance by training very large models and stopping significantly short of convergence

    • Note that this may not be in contradiction with Chinchilla
    • This corroborates with the current “single-epoch” framework, when is very large.
    • Data requirements grow very slowly w.r.t training compute i.e.
      • 10x the compute , only 2x the data