Summary

muP is “just” the idea that the singular values of weight matrices and updates should scale like sqrt(fan-out / fan-in) This type of matrix preserves the RMS-norm of aligned inputs.
- Gaussian init with standard deviation 1/sqrt(fan-in) is “wrong” because it makes the singular values the wrong size in highly rectangular matrices, and so they will act badly on aligned inputs.
- The easiest way to initialize is just to sample a random orthogonal matrix (all singular values are one) and then multiply by sqrt(fan-out / fan-in).

🤖 Harold's Notes