HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA Megatron advances optimizers for faster LLM training

NVIDIA has announced comprehensive support for emerging higher-order optimization algorithms, such as Muon and Shampoo, to accelerate large language model training on its latest hardware. While these methods have been used in neural networks for a decade, they recently demonstrated significant success in training top-tier open-source models like Kimi K2 and GLM-5. This advancement addresses the practical challenges of deploying complex optimizers at scale, including high memory costs, numerical instability, and communication bottlenecks. Performance evaluations conducted on the NVIDIA GB300 NVL72 system reveal that Muon achieves training throughput comparable to the standard AdamW optimizer, with minimal performance loss. When accounting for the computational cost of Newton-Schulz iterations, Muon even demonstrates higher Model FLOPs Utilization. These results were obtained using NVIDIA NeMo Megatron Bridge 26.02. For example, training Kimi K2 on 256 GB300 GPUs utilized a specific parallel configuration, while Qwen3 30B-A3B was trained on eight GPUs. The data indicates that Muon can match or slightly exceed AdamW efficiency depending on the metric used. To enable these optimizations, NVIDIA introduced several key technologies within the Megatron Core library. A primary innovation is the layer-wise distributed optimizer. Traditional element-wise distribution splits optimizer states across data parallel ranks, which is incompatible with optimizers like Muon that require full layer gradients for preconditioning. The new layer-wise approach assigns entire layers to specific ranks, allowing for accurate global preconditioner calculation. This requires variable-size communication but is fully integrated into the framework. Handling tensor parallelism presents unique challenges, as individual weight matrices are sharded across devices. To manage the orthogonalization steps required by Muon, NVIDIA implemented three modes within the TensorParallelMuon framework. The duplicated mode gathers momentum across all tensor parallel devices to perform full Newton-Schulz iterations on every GPU, optimizing for network latency. The distributed mode splits the computation of Newton-Schulz iterations across devices, optimizing for computational throughput at the cost of more frequent communication. A third blockwise mode operates without communication by orthogonalizing only the local momentum blocks, offering a cost-effective but mathematically distinct alternative. Further performance gains are achieved through communication hiding, which delays parameter gathering to overlap with the next batch's forward computation. Load balancing strategies distribute layers across GPUs based on computational cost, while SYRK kernels and fused all-reduce operations reduce floating-point operations and bandwidth usage. NVIDIA also plans future enhancements using CuTe DSL to fuse communication into computation kernels more granularly. Beyond Muon, NVIDIA supports other research-grade optimizers, including SOAP, to help the community explore efficiency boundaries. The company provides detailed instructions and code repositories for reproducing these results and integrating Muon into existing training pipelines. By combining layer-wise distribution, optimized Newton-Schulz iterations, and advanced kernel techniques, NVIDIA has made it possible to deploy these powerful higher-order methods on massive scales, effectively pushing the limits of LLM training efficiency. Developers are encouraged to utilize the Megatron Bridge performance recipes to begin experimenting with these technologies immediately.

Related Links