HyperAI초신경

Model Merging in Pre-training of Large Language Models

Li, Yunshui ; Ma, Yiyuan ; Yan, Shen ; Zhang, Chaoyi ; Liu, Jing ; Lu, Jianqiao ; Xu, Ziwen ; Chen, Mengzhao ; Wang, Minrui ; Zhan, Shiyi ; Ma, Jin ; Lai, Xunhao ; Luo, Yao ; Bin, Xingyan ; Ren, Hongbin ; Han, Mingji ; Hao, Wenhao ; Yi, Bairen ; Liu, LingJun ; Ma, Bole ; Jia, Xiaoying ; Xun, Zhou ; Xiang, Liang ; Wu, Yonghui
발행일: 5/21/2025
Model Merging in Pre-training of Large Language Models
초록

Model merging has emerged as a promising technique for enhancing largelanguage models, though its application in large-scale pre-training remainsrelatively unexplored. In this paper, we present a comprehensive investigationof model merging techniques during the pre-training process. Through extensiveexperiments with both dense and Mixture-of-Experts (MoE) architectures rangingfrom millions to over 100 billion parameters, we demonstrate that mergingcheckpoints trained with constant learning rates not only achieves significantperformance improvements but also enables accurate prediction of annealingbehavior. These improvements lead to both more efficient model development andsignificantly lower training costs. Our detailed ablation studies on mergingstrategies and hyperparameters provide new insights into the underlyingmechanisms while uncovering novel applications. Through comprehensiveexperimental analysis, we offer the open-source community practicalpre-training guidelines for effective model merging.