Model Merging in Pre-training of Large Language Models

Model merging has emerged as a promising technique for enhancing largelanguage models, though its application in large-scale pre-training remainsrelatively unexplored. In this paper, we present a comprehensive investigationof model merging techniques during the pre-training process. Through extensiveexperiments with both dense and Mixture-of-Experts (MoE) architectures rangingfrom millions to over 100 billion parameters, we demonstrate that mergingcheckpoints trained with constant learning rates not only achieves significantperformance improvements but also enables accurate prediction of annealingbehavior. These improvements lead to both more efficient model development andsignificantly lower training costs. Our detailed ablation studies on mergingstrategies and hyperparameters provide new insights into the underlyingmechanisms while uncovering novel applications. Through comprehensiveexperimental analysis, we offer the open-source community practicalpre-training guidelines for effective model merging.