Training
-
Picotron (nanoGPT for 4D parallelism)
-
Megatron blogpost (scatter-gather optimization, Performance microbenchmarks for pipeline parallelism, )
-
GSPMD: General and Scalable Parallelization for ML Computation Graphs
- GSPMD is now the fundamental component of JAX/TensorFlow distributed training and enables various optimizations with the XLA compiler to allow users to train their models efficiently in a large scale setting.
-
Universal Checkpointing https://x.com/stasbekman/status/1808287880781127930?s=12