-
https://wandb.ai/dalle-mini/dalle-mini/reports/An-Evaluation-of-Transformer-VariantsâVmlldzoxNjk4MTIw experiments on tweaks to transformer architecture for text-to-image
-
Very nice summary of model architecture by Songling Yang https://sustcsonglin.github.io/assets/pdf/talk_250117.pdf
-
Transformer improvement Thread
- ReFormer: O(n) memory down to O(1), same convergence Paper: https://arxiv.org/abs/2001.04451
- MLP-Mixer: Outscales transformer in the high-data-regime while using a fraction of the parameters and runtime (2x with full-context L3.1-405B; 9x with 8192 tokens) Paper: https://arxiv.org/abs/2105.08050
- WideNet: MoE uses too much memory, so why not share the experts across layers? WideNet reduces memory and gave me 2x end-to-end training speedups. Paper: https://arxiv.org/abs/2107.11817
-
Mechanistic Design and Scaling of Hybrid Architectures
-
Scaling Laws for Linear Complexity Language Models
-
Compute Better Spent: Replacing Dense Layers with Structured Matrices
-
An Empirical Study of Mamba-based Language Models
-
Scaling laws with structured layers https://arxiv.org/pdf/2410.02117