Algortihm

  • are the exponential decay rates for the moment estimates

  • is for stability reasons

  • Given the gradient at timestep , Adam

    • maintains a (biased) first moment of the gradient

    • maintains a (biased) second raw moment of the gradient (elementwise-square)

    • debiases the first and second moment before updating the params

    • Updates parameters by

Intuition

  • is the signal-to-noise ratio (SNR)
  • Indeed, the effective stepsize taken in parameter space is bounded by
  • This also means that Adam is scale-invariant, rescaling the gradients by a factor will scale by and by , which cancel out when computing the parameter update.

Takeaways

  • If variance is low SNR is high we take larger effective step-sizes
  • If SNR is low mean gradient is getting low or variance is very high in both cases, we want to take smaller steps
  • Note that the SNR is maintained for each parameter, so we keep an estimate of which loss landscape has been smooth or not for each parameter