Algortihm
-
are the exponential decay rates for the moment estimates
-
is for stability reasons
-
Given the gradient at timestep , Adam
-
maintains a (biased) first moment of the gradient
-
maintains a (biased) second raw moment of the gradient (elementwise-square)
-
debiases the first and second moment before updating the params
-
Updates parameters by
-
Intuition
- is the signal-to-noise ratio (SNR)
- Indeed, the effective stepsize taken in parameter space is bounded by
- This also means that Adam is scale-invariant, rescaling the gradients by a factor will scale by and by , which cancel out when computing the parameter update.
Takeaways
- If variance is low ⇒ SNR is high ⇒ we take larger effective step-sizes
- If SNR is low ⇒ mean gradient is getting low or variance is very high ⇒ in both cases, we want to take smaller steps
- Note that the SNR is maintained for each parameter, so we keep an estimate of which loss landscape has been smooth or not for each parameter