Summary
 ALL THE CONCLUSIONS ONLY HOLD IN THE LOW DATA REGIME, WHEN THERE’S NOT ENOUGH FINETUNING DATA
 When data is limiting performance
 Classic/OLD supervised ML: When we train increasingly large neural networks fromscratch on a fixedsize dataset, they eventually become datalimited and stop improving in performance (crossentropy loss)
 Unsupervised, finetuning setting. When we do the same for models pretrained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero
 Takeaways
 In the low data regime, the effective amount of data transferred by pretraining can be around 100x the size of the finetuning data.
 Given an estimation $D_{T}=k(D_{F})_{α}(N)_{β}$ for your particular problem.
 One can make choices w.r.t to collecting more data v.s. increasing model size.
 Indeed, for example, for transfer from text to python we have $β≈2α$.
 So increasing the dataset size by a factor, $C$, would be worth approximately the same as increasing the model size, $N$ , by $C $.
 In other words, a 10x increase in model size, $N$ , would be worth approximately a 100x increase in finetuning dataset size, $D_{F}$ , under these conditions.
 One can make choices w.r.t to collecting more data v.s. increasing model size.
 How to estimate
 Finetune a model with a 1% and 10% of the existing dataset.
 Vary the model size given the full finetuning dataset and estimate the lost performance in terms of a reduction in model size.
 Check Appendix B and C for how to fit the power laws
Detailed
Units of data
 The paper focuses on units of data, while holding everything else fixed.
 They calculate the effective data “transferred” from pretraining by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch.
 $D_{T}$ is the amount of additional python characters that a fromscratch model of the same size would have needed to achieve the same loss on python as a finetuned model.
 In the labeled example, we see that for a 40M parameter transformer, finetuned on 3e5 characters, $D_{T}$ is approximately 1000x bigger than $D_{F}$ .
 The less finetuning data is available, the more pretraining helps.
 In the limit, they are the same, as the model will learn the distribution perfectly
Power Laws

Target: generating python code

They find that the effective data transferred is described well in the low data regime by a powerlaw of parameter count and finetuning dataset size.
 $D_{T}=k(D_{F})_{α}(N)_{β}$
 $D_{T}=2.1e5(D_{F})_{0.096}(N)_{0.38}$ (pretrained on text and other programming languages)
 $D_{T}=1.9e4(D_{F})_{0.18}(N)_{0.38}$ (pretrained on text only)
 The larger $k$ value indicates that mixture of distributions transfer more readily than plain text in the lowdata regime, while the smaller $α$ means the benefit diminishes as we approach the highdata regime.

When comparing pretraining on text and pretraining on an equal mix of text and nonpython code, they found identical scaling with model size, the exponent $β=0.38$.
 Thus the exponent beta appears to depend only on the model architecture and target distribution.
 they hypothesize that it measures how the model architecture generalizes on the target distribution.

The quantity $α$ provides a useful measure of the directed proximity of two distributions, with smaller $α$ indicating closer proximity.
 Measurements of $α$ are cheap and enable one to make principled tradeoffs between collecting expensive finetuning data and increasing model size.
 For transfer from text to python we have $β≈2α$ so increasing the dataset size by a factor, $C$, would be worth approximately the same as increasing the model size, $N$ , by $C $ In other words, a 10x increase in model size, $N$ , would be worth approximately a 100x increase in finetuning dataset size, $D_{F}$ , under these conditions.
Data Multiplier
 An implication of the power law is that pretraining effectively multiplies the finetuning dataset, $D_{F}$ , in the lowdata regime. We find the multiplier formulation helpful in building intuition. Note that the multiplier goes down as $D_{F}$ increases.
 $effective data multiplier=D_{F}D_{F}+D_{T} ≈D_{F}D_{T} =(D_{F})_{1−α}k(N)_{β} $
 If $D_{F}$ is fixed, increasing $N$ by 10x effectively multiplies the data by $(10)_{0.38}≈2$
 In the limit of very extensive pretraining, $β$ may approach 1, and increasing $N$ directly increases the effective data by the same amount.
 When data is limiting performance i.e. $D_{F}$ fixed and small, the pretrained models have a better scaling law i.e. they are able to quickly improve even on small amounts of data