Practical Efficiency of Muon for Pretraining

We demonstrate that Muon, the simplest instantiation of a second-orderoptimizer, explicitly expands the Pareto frontier over AdamW on thecompute-time tradeoff. We find that Muon is more effective than AdamW inretaining data efficiency at large batch sizes, far beyond the so-calledcritical batch size, while remaining computationally efficient, thus enablingmore economical training. We study the combination of Muon and the maximalupdate parameterization (muP) for efficient hyperparameter transfer and presenta simple telescoping algorithm that accounts for all sources of error in muPwhile introducing only a modest overhead in resources. We validate our findingsthrough extensive experiments with model sizes up to four billion parametersand ablations on the data distribution and architecture.