Basic formulation
Extensions
-
MuonBP: Faster Muon via Block-Periodic Orthogonalization
- Using Muon on large sharded models creates extra communication overhead from gather/scatter operations on sharded matrices. Turns out you can fix this by doing full Muon updates periodically (but not skipping them entirely) and using local Muon computation the rest of the time.
-
https://www.essential.ai/blog/infra layer sharding for large scale training with muon
-
Various approaches to parallelizing Muon (mainhorse blog)
-
Practical Efficiency of Muon for Pretraining
- essentialAI infra tips for muon