“An Empirical Model of LargeBatch Training”
Takeaways

The gradient noise scale (essentially a measure of the signaltonoise ratio of gradient across training examples) predicts the largest useful batch size across many domains and applications

Besttradeoff for efficiency is to select $B=B_{noise}$

Noise scale growth over training: $B$ will grow when the gradient decreases in magnitude, as long as the noise $tr(Σ)$ stays roughly constant. Since $∣G∣$ decreases as we approach the minimum of a smooth loss, we would expect B to increase during neural network training.
 The optimal learning rate initially scales linearly as we increase the batch size. For Adam and RMSProp, the optimal learning rate initially obeys a power law $ϵ(B)∼B_{α}$ with $α$ between $0.5$ and $1.0$ depending on the task,
 Early in training, smaller batches are sufficient to make optimal progress, while larger batches are required later in training.

Why it matters:
 A major enabler in the growth of DL has been parallelism – the extent to which a training process can be usefully spread across multiple devices. Regardless of how much total computation is available, if model training cannot be sufficiently parallelized, then it may take too much serial time and therefore may be practically infeasible
 Common source of parallelism: data parallelism
 Large batch sizes can achieve almost linear speedups in training without substantially harming sample efficiency or generalization.
 A major enabler in the growth of DL has been parallelism – the extent to which a training process can be usefully spread across multiple devices. Regardless of how much total computation is available, if model training cannot be sufficiently parallelized, then it may take too much serial time and therefore may be practically infeasible

Why is there an optimal batch size?
 Set up: SGD optimization
 If batch size very small
 The gradient approximation will have very high variance, and the update will be mostly noise
 Over many updates, the noise will wash out
 However, this requires many sequential steps
 We can get a linear gain in efficiency by parallelizing i.e. increasing the batch size and get an equivalent results, by aggregating those small updates and applying them all at once (by increasing batch size and learning rate)
 If batch size very big
 The gradient approximation is equal to the true gradient
 However, reducing the batch size by two will likely give us the same approximation
 Thus, we’re using twice as much computation for little gain
 The optimal batch size gives us a good (not too noisy) gradient approximation, while being compute efficient.

Smaller batch size may helpful to escape local minimas, as the noise in the estimated gradient will get us out of it
Definition of gradient noise scale
Summary
 Gradient noise scale $B_{noise}≈∣G∣_{2}tr(Σ) $ = the noise scale is equal to the sum of the variances of the individual gradient components, divided by the global norm of the gradient
 The optimal improvement of the loss from the noisy gradient update is a function of $B_{noise}$ and $B$ i.e. $ΔL_{opt}(B)=1+B_{noise}/BΔL_{max} $
 In this way, we can compute when we get diminishing returns from increasing batch size, the conclusion is that batch size should roughly equal the noise scale
Detailed

Consider a model, parameterized by variables $θ$, whose performance is assessed by a loss function $L(θ)$. The loss function is given by an average over a distribution $ρ(x)$ over data points $x$. Each data point $x$ has an associated loss function $L_{x}(θ)$, and the full loss is given by $L(θ)=E_{x∼ρ}[L_{x}(θ)]$

The true gradient is $G(θ)=∇L(θ)$

Estimated gradient: $G_{est}(θ)=B1 ∑_{i=1}∇_{θ}L_{x_{i}}(θ);x_{i}∼ρ$

$G_{est}(θ)$ is a random variable such that
 $E[G_{est}(θ)]$ = $G(θ)$
 $cov(G_{est}(θ))=B1 Σ(θ)$
 where the perexample covariance matrix of the gradient $Σ(θ)=cov_{x∼ρ}(∇_{θ}L_{x}(θ))=E_{x∼ρ}[(∇_{θ}L_{x}(θ))(∇_{θ}L_{x}(θ))_{T}]−G(θ)G(θ)_{T}$

We are interested in how useful the gradient is for optimization purposes as a function of $B$, and how that might guide us in choosing a good $B$.

We can do this by connecting the noise in the gradient to the maximum improvement in true loss that we can expect from a single gradient update.

Letting $G$ and $H$ be the true gradient and hessian at parameters $θ$
 We can perturb the parameters $θ$ by a small vector $V$ to $θ−ϵV$, where $ϵ$ is the step size
 We do a quadratic Taylor expansion of the loss around this perturbation
 $L(θ−ϵV)=L(θ)−ϵG_{T}V+21 ϵ_{2}V_{T}HV$
 Now if we replace $V$ by $G_{est}$, we get on expectation
 $E[L(θ−ϵG_{est})]=L(θ)−ϵ∣G∣_{2}+21 ϵ_{2}(G_{T}HG+Btr(HΣ) )$
 for the trace term, remember that the scalar $E[G_{est}G_{est}]=tr(Σ)$
 Let’s minimize this equation w.r.t $ϵ$
 $dϵd E[L(θ−ϵG_{est})]=−∣G∣_{2}+ϵ(G_{T}HG+Btr(HΣ) )$
 $ϵ_{opt}(B)=(G_{T}HG+Btr(HΣ) )∣G∣_{2} =1+B_{noise}/Bϵ_{max} $
 where $ϵ_{max}=G_{T}HG∣G∣_{2} $, the optimal step size given the true gradient $G$
 and $B_{noise}=G_{T}HGtr(HΣ) $ is the gradient noise scale
 Big assumption, if we assume the optimization is perfectly wellconditioned, then the hessian $H$ is a multiple of the identity matrix $I$.
 then, we can compute a simple estimate of the gradient noise scale $B_{simple}=∣G∣_{2}tr(Σ) $ which is the the sum of the variances of the individual gradient components divided by the norm of the gradient.
How to compute the gradient noise scale in practice
 There’s a a method for measuring the noise scale that comes essentially for free in a dataparallel training environment
 We estimate the noise scale by comparing the norm of the gradient for different batch sizes
Derivation
 The expected gradient norm norm for a batch size $B$ is $E[∣G_{est}∣_{2}]=∣G∣_{2}+B1 tr(Σ)$
 Given estimates of $∣G_{est}∣_{2}$ for both $B=B_{small}$ and $B=B_{big}$, we can obtain unbiased estimates of $∣G∣_{2}$ and $S$ for $∣G∣_{2}$ and $tr(Σ)$, respectively
 We can compute $∣G_{B_{small}}∣_{2}$ and $∣G_{B_{big}}∣_{2}$ for free in a dataparallel method by computing the norm of the gradient before and after averaging between devices :)))
 Note that $∣G∣_{2}/S$ is not an unbiased estimator of $B_{simple}=∣G∣_{2}tr(Σ) $
 Thus, we keep an EMA of both values such that the ratio of the moving averages is a good estimator of the noise scale.