Taming LLMs by Scaling Learning Rates with Gradient Grouping

Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu

Date de publication: 6/3/2025

Taming LLMs by Scaling Learning Rates with Gradient Grouping

Résumé

Training large language models (LLMs) poses challenges due to their massivescale and heterogeneous architectures. While adaptive optimizers like AdamWhelp address gradient variations, they still struggle with efficient andeffective parameter-wise learning rate estimation, resulting in traininginstability, slow convergence, and poor compatibility with parameter-efficientfine-tuning (PEFT) techniques. This work introduces Scaling with GradientGrouping (SGG), an optimizer wrapper that improves adaptive learning rateestimation by dynamic grouping and group-specific scaling. SGG first groupsgradient statistics in each layer into clusters and then appliescluster-specific scaling to calibrate learning rates for each parameter, thusimposing collective group-wise constraints while maintaining preciseper-parameter adaptation. Experiments on diverse (M)LLM benchmarks show thatSGG integrates seamlessly with existing optimizers, and offers consistent gainsand faster convergence over baselines, with various model sizes. Its stabilityacross varying batch sizes and learning rates establishes SGG as a robustchoice for LLM optimization.

Voir les détails de l'article