- NeurIPS - Tutorial on Language Modeling
-
Physics of LLMs https://arxiv.org/abs/2404.05405.
-
MiniCPM: Unveiling the Potential of End-side Large Language Models
-
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
-
fine-web-edu creation
-
Very nice summary of model architecture by Songling Yang https://sustcsonglin.github.io/assets/pdf/talk_250117.pdf
-
Scaling Book - A Systems View of LLMs on TPUs (very good read)
-
DeepSeekv3
- codebase
- DeepSeek-V3 Technical Report
- The first 3 layers are dense.
- https://x.com/armenagha/status/1872426813865201700Essentially “If you don’t do this, you see router collapse (router always picked single expert in TC) in the earlier layers if you didn’t very carefully balance experts/upcycle or use EC. You can see a similar phenomena in the very last layers as well.”
- The first 3 layers are dense.