Summary

  • ALL THE CONCLUSIONS ONLY HOLD IN THE LOW DATA REGIME, WHEN THERE’S NOT ENOUGH FINETUNING DATA
  • When data is limiting performance
    • Classic/OLD supervised ML: When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss)
    • Unsupervised, fine-tuning setting. When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero
  • Takeaways
    • In the low data regime, the effective amount of data transferred by pre-training can be around 100x the size of the finetuning data.
    • Given an estimation for your particular problem.
      • One can make choices w.r.t to collecting more data v.s. increasing model size.
        • Indeed, for example, for transfer from text to python we have .
        • So increasing the data-set size by a factor, , would be worth approximately the same as increasing the model size, , by .
        • In other words, a 10x increase in model size, , would be worth approximately a 100x increase in fine-tuning dataset size, , under these conditions.
    • How to estimate
      • Fine-tune a model with a 1% and 10% of the existing dataset.
      • Vary the model size given the full fine-tuning dataset and estimate the lost performance in terms of a reduction in model size.
      • Check Appendix B and C for how to fit the power laws

Detailed

Units of data

  • The paper focuses on units of data, while holding everything else fixed.
  • They calculate the effective data “transferred” from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch.
    • is the amount of additional python characters that a from-scratch model of the same size would have needed to achieve the same loss on python as a fine-tuned model.
    • In the labeled example, we see that for a 40M parameter transformer, fine-tuned on 3e5 characters, is approximately 1000x bigger than .
    • The less fine-tuning data is available, the more pre-training helps.
      • In the limit, they are the same, as the model will learn the distribution perfectly

Power Laws

  • Target: generating python code

  • They find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.

    • (pretrained on text and other programming languages)
    • (pretrained on text only)
    • The larger value indicates that mixture of distributions transfer more readily than plain text in the low-data regime, while the smaller means the benefit diminishes as we approach the high-data regime.
  • When comparing pre-training on text and pre-training on an equal mix of text and non-python code, they found identical scaling with model size, the exponent .

    • Thus the exponent beta appears to depend only on the model architecture and target distribution.
    • they hypothesize that it measures how the model architecture generalizes on the target distribution.
  • The quantity provides a useful measure of the directed proximity of two distributions, with smaller indicating closer proximity.

    • Measurements of are cheap and enable one to make principled trade-offs between collecting expensive fine-tuning data and increasing model size.
    • For transfer from text to python we have so increasing the data-set size by a factor, , would be worth approximately the same as increasing the model size, , by In other words, a 10x increase in model size, , would be worth approximately a 100x increase in fine-tuning dataset size, , under these conditions.

Data Multiplier

  • An implication of the power law is that pre-training effectively multiplies the fine-tuning dataset, , in the low-data regime. We find the multiplier formulation helpful in building intuition. Note that the multiplier goes down as increases.
  • If is fixed, increasing by 10x effectively multiplies the data by
  • In the limit of very extensive pretraining, may approach 1, and increasing directly increases the effective data by the same amount.
  • When data is limiting performance i.e. fixed and small, the pre-trained models have a better scaling law i.e. they are able to quickly improve even on small amounts of data