Practical Efficiency of Muon for Pretraining

Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani

تاريخ النشر: 5/7/2025

Practical Efficiency of Muon for Pretraining

الملخص

We demonstrate that Muon, the simplest instantiation of a second-orderoptimizer, explicitly expands the Pareto frontier over AdamW on thecompute-time tradeoff. We find that Muon is more effective than AdamW inretaining data efficiency at large batch sizes, far beyond the so-calledcritical batch size, while remaining computationally efficient, thus enablingmore economical training. We study the combination of Muon and the maximalupdate parameterization (muP) for efficient hyperparameter transfer and presenta simple telescoping algorithm that accounts for all sources of error in muPwhile introducing only a modest overhead in resources. We validate our findingsthrough extensive experiments with model sizes up to four billion parametersand ablations on the data distribution and architecture.

عرض تفاصيل الورقة البحثية