Scaling Laws for Transfer

Summary

ALL THE CONCLUSIONS ONLY HOLD IN THE LOW DATA REGIME, WHEN THERE’S NOT ENOUGH FINETUNING DATA
When data is limiting performance
- Classic/OLD supervised ML: When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss)
- Unsupervised, fine-tuning setting. When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero
Takeaways
- In the low data regime, the effective amount of data transferred by pre-training can be around 100x the size of the finetuning data.
- Given an estimation $D_{T} = k (D_{F})^{α} (N)^{β}$ for your particular problem.
  - One can make choices w.r.t to collecting more data v.s. increasing model size.
    - Indeed, for example, for transfer from text to python we have $β \approx 2 α$ .
    - So increasing the data-set size by a factor, $C$ , would be worth approximately the same as increasing the model size, $N$ , by $C$ .
    - In other words, a 10x increase in model size, $N$ , would be worth approximately a 100x increase in fine-tuning dataset size, $D_{F}$ , under these conditions.
- How to estimate
  - Fine-tune a model with a 1% and 10% of the existing dataset.
  - Vary the model size given the full fine-tuning dataset and estimate the lost performance in terms of a reduction in model size.
  - Check Appendix B and C for how to fit the power laws

Detailed

Units of data

The paper focuses on units of data, while holding everything else fixed.
They calculate the effective data “transferred” from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch.
- $D_{T}$ is the amount of additional python characters that a from-scratch model of the same size would have needed to achieve the same loss on python as a fine-tuned model.
- In the labeled example, we see that for a 40M parameter transformer, fine-tuned on 3e5 characters, $D_{T}$ is approximately 1000x bigger than $D_{F}$ .
- The less fine-tuning data is available, the more pre-training helps.
  - In the limit, they are the same, as the model will learn the distribution perfectly

Power Laws

Target: generating python code
They find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
- $D_{T} = k (D_{F})^{α} (N)^{β}$
- $D_{T} = 2.1 e 5 (D_{F})^{0.096} (N)^{0.38}$ (pretrained on text and other programming languages)
- $D_{T} = 1.9 e 4 (D_{F})^{0.18} (N)^{0.38}$ (pretrained on text only)
- The larger $k$ value indicates that mixture of distributions transfer more readily than plain text in the low-data regime, while the smaller $α$ means the benefit diminishes as we approach the high-data regime.
When comparing pre-training on text and pre-training on an equal mix of text and non-python code, they found identical scaling with model size, the exponent $β = 0.38$ .
- Thus the exponent beta appears to depend only on the model architecture and target distribution.
- they hypothesize that it measures how the model architecture generalizes on the target distribution.
The quantity $α$ provides a useful measure of the directed proximity of two distributions, with smaller $α$ indicating closer proximity.
- Measurements of $α$ are cheap and enable one to make principled trade-offs between collecting expensive fine-tuning data and increasing model size.
- For transfer from text to python we have $β \approx 2 α$ so increasing the data-set size by a factor, $C$ , would be worth approximately the same as increasing the model size, $N$ , by $C$ In other words, a 10x increase in model size, $N$ , would be worth approximately a 100x increase in fine-tuning dataset size, $D_{F}$ , under these conditions.

Data Multiplier

An implication of the power law is that pre-training effectively multiplies the fine-tuning dataset, $D_{F}$ , in the low-data regime. We find the multiplier formulation helpful in building intuition. Note that the multiplier goes down as $D_{F}$ increases.
$effective data multiplier = \frac{D _{F} + D _{T}}{D _{F}} \approx \frac{D _{T}}{D _{F}} = \frac{k ( N ) ^{β}}{( D _{F} ) ^{1 - α}}$
If $D_{F}$ is fixed, increasing $N$ by 10x effectively multiplies the data by $(10)^{0.38} \approx 2$
In the limit of very extensive pretraining, $β$ may approach 1, and increasing $N$ directly increases the effective data by the same amount.
When data is limiting performance i.e. $D_{F}$ fixed and small, the pre-trained models have a better scaling law i.e. they are able to quickly improve even on small amounts of data

🤖 Harold's Notes

Explorer

Scaling Laws for Transfer

Summary

Detailed

Units of data

Power Laws

Data Multiplier

Graph View

Table of Contents

Backlinks