Optimizers

Zack Nado blog on tuning $ϵ$
The General Theory of Modular Duality by Jeremey Bernstein (thread of twitter papers)
- https://x.com/jxbz/status/1851328119539429487
ScheduleFree SOAP
- converges faster and uses significantly less memory than SOAP
Shampoo: https://proceedings.mlr.press/v80/gupta18a/gupta18a.pdf
SOAP: https://arxiv.org/abs/2409.11321
Torch optimizer code: https://github.com/ClashLuke/heavyball?tab=readme-ov-file#foreachsoap
distributed PyTorch implementation of SOAP/eigenvalue-corrected Shampoo with support for low precision data types https://x.com/runame_/status/1854242159483867518
Muon blog post
Old Optimizer, New Norm: An Anthology
One might think shampoo is that weird radical-ass-new-optimizer that promises wild performance. let me tell you something reassuring. shampoo (with blocks), is generalization of adam. That is, shampoo with specific hyperparameter is adam. It is logically impossible for well-tuned shampoo to perform worse than adam.

🤖 Harold's Notes