-
Partially explained in TFLOPS, MFU
-
MFU= the ratio of the observed throughput (tokens-per-second) relative to the theoretical maximum throughput of a system operating at peak FLOPs.
-
Defined in Appendix B of PaLM paper https://arxiv.org/pdf/2204.02311
torchtitan
https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/torchtitan/utils.py#L123
https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train.py#L434
Mamba layers