-
Better & Faster Large Language Models via Multi-token Prediction
-
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Optimizers
-
The General Theory of Modular Duality by Jeremey Bernstein (thread of twitter papers)
-
- converges faster and uses significantly less memory than SOAP
-
Shampoo: https://proceedings.mlr.press/v80/gupta18a/gupta18a.pdf
-
Torch optimizer code: https://github.com/ClashLuke/heavyball?tab=readme-ov-file#foreachsoap
-
distributed PyTorch implementation of SOAP/eigenvalue-corrected Shampoo with support for low precision data types https://x.com/runame_/status/1854242159483867518
-
One might think shampoo is that weird radical-ass-new-optimizer that promises wild performance. let me tell you something reassuring. shampoo (with blocks), is generalization of adam. That is, shampoo with specific hyperparameter is adam. It is logically impossible for well-tuned shampoo to perform worse than adam.