-
Simo Ryu guide to scaling from small-scale proxy https://cloneofsimo.notion.site/What-to-do-to-scale-up-09e469d7c3444d6a90305397c38a46f5
-
Scaling Book - A Systems View of LLMs on TPUs (very good read)
-
- Initial takeaway: many tricks focus on training stability, particularly suitable for low-precision scenarios, e.g, logit soft-capping and sandwich layer normalization. Does this hint at int8 training being crucial?