🤖 Harold's Notes

Search

❯

❯

❯

❯

Adam

Jul 03, 20241 min read

Algortihm

$β_{1}, β_{2}$ are the exponential decay rates for the moment estimates
$ϵ$ is for stability reasons
Given the gradient $g_{t}$ at timestep $t$ , Adam
- maintains a (biased) first moment of the gradient
  - $m_{t} \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$
- maintains a (biased) second raw moment of the gradient (elementwise-square)
  - $v_{t} \leftarrow β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$
- debiases the first and second moment before updating the params
  - $\overset{m}{^}_{t} \leftarrow m_{t} / (1 - β_{1}^{t})$
  - $\overset{v}{^}_{t} \leftarrow v_{t} / (1 - β_{2}^{t})$
- Updates parameters by
  - $θ_{t} \leftarrow θ_{t - 1} - γ \cdot \overset{m}{^}_{t} / (\overset{v}{^}_{t} + ϵ)$

Intuition

$m_{t} / \overset{v}{^}_{t}$ is the signal-to-noise ratio (SNR)
Indeed, the effective stepsize taken in parameter space is bounded by $γ ∣ m_{t} / \overset{v}{^}_{t} ∣$
This also means that Adam is scale-invariant, rescaling the gradients by a factor $c$ will scale $\overset{m}{^}_{t}$ by $c$ and $\overset{v}{^}_{t}$ by $c^{2}$ , which cancel out when computing the parameter update.

Takeaways

If variance is low ⇒ SNR is high ⇒ we take larger effective step-sizes
If SNR is low ⇒ mean gradient is getting low or variance is very high ⇒ in both cases, we want to take smaller steps
Note that the SNR is maintained for each parameter, so we keep an estimate of which loss landscape has been smooth or not for each parameter

Graph View

Algortihm
Intuition

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2024