HyperAI

NVIDIA has introduced enhanced reinforcement learning capabilities in NeMo-RL v0.3 with full support for the Megatron-Core backend, delivering substantial performance improvements for training large language models. While the initial release of NeMo-RL relied on PyTorch DTensor (FSDP2) for distributed training, this approach faces limitations at scale. As model sizes grow into the hundreds of billions of parameters, activation memory and recompute overhead significantly slow down training steps. Additionally, DTensor lacks access to NVIDIA’s optimized CUDA kernels and advanced performance features, making it less efficient for high-throughput training. To address these challenges, NeMo-RL now integrates Megatron-Core, a GPU-optimized library designed for massive model training. Megatron-Core leverages a 6D parallelism strategy—combining tensor, pipeline, data, sequence, context, and expert parallelism—to balance computation and communication efficiently. This enables faster training with better resource utilization, especially for dense models and Mixture of Experts (MoE) architectures. NeMo-RL simplifies the complexity of Megatron-Core by abstracting low-level configuration details. Users can enable Megatron-based training by adding a policy.megatron_cfg section to their YAML configuration file and setting enabled: true. All configuration parameters are automatically passed to Megatron, allowing seamless integration without requiring deep expertise in parallelism settings. Performance benchmarks show clear advantages of Megatron-Core over DTensor. For example, training the Llama 3.1-8B Instruct model with Megatron reduces total step time by 20% compared to DTensor, while the Llama 3.1-70B Base model sees a 30% improvement. Similar gains are observed for Qwen3 models, with significantly faster step times and stable convergence. These results are achieved with identical training objectives and hyperparameters, confirming that performance gains do not come at the cost of model quality. Key optimizations like sequence packing and importance sampling further enhance efficiency. Sequence packing reduces padding by combining multiple sequences into longer ones, cutting step time by up to 50% for models like Llama 70B without affecting convergence. Importance sampling helps align inference and training distributions, reducing variance and improving training stability across different backends. Megatron-Core also supports long-context training, enabling models to handle sequences up to 16,384 tokens. The Llama 3.3-70B Instruct model achieves strong performance at this length, with total step times under 45 seconds on 16 nodes with 8 GPUs each, demonstrating scalability for real-world applications. NeMo-RL v0.3 also introduces other improvements, including better support for MoE models, enhanced logging, and streamlined experiment tracking. Future updates will bring additional features such as advanced curriculum learning, improved checkpointing, and expanded model zoo integration. In summary, NeMo-RL v0.3 with Megatron-Core backend delivers a powerful, scalable solution for efficient reinforcement learning post-training of large models. With optimized performance, support for long sequences, and user-friendly configuration, it empowers researchers and engineers to train state-of-the-art models faster and more reliably. Developers are encouraged to explore the official documentation, example scripts, and configuration files to begin leveraging these advancements.

NVIDIA NeMo-RL v0.3 Unveils Megatron-Core Support for High-Performance Reinforcement Learning Training on Large Models

Related Links