BroRL Breaks Reinforcement Learning Plateaus with Rollout Scaling for Superior LLM Reasoning
NVIDIA Research has introduced Broadened Reinforcement Learning (BroRL), a breakthrough approach that overcomes performance plateaus in training large language models with reinforcement learning from verifiable rewards (RLVR). While earlier methods like Prolonged Reinforcement Learning (ProRL) achieved gains by extending training steps, they eventually hit a ceiling where performance stalled or declined. BroRL addresses this by shifting focus from increasing training length to dramatically scaling the number of exploratory rollouts per prompt—increasing from 16 to 512—effectively broadening the model’s exploration space. This rollout scaling fundamentally stabilizes the reinforcement learning signal. The core insight lies in balancing two forces: the feedback from sampled paths (actual rollouts) and the uncertainty from unexplored paths (unsampled space). With few rollouts, noise from unexplored regions dominates, pulling performance downward and trapping the model on a plateau. BroRL solves this by sending out many more rollouts, averaging out the noise and amplifying the upward signal from successful trajectories. Theoretical analysis confirms that when rollout count N is sufficiently large, the net learning signal becomes positive, enabling sustained improvement. Empirical results demonstrate BroRL’s superiority. When applied to a ProRLv2 model that had plateaued after 3,000 steps, BroRL reversed stagnation and delivered continuous gains. On the Math benchmark, BroRL surpassed ProRL’s peak score within just 98.1 hours—35 hours faster—achieving a final score of 63.66 compared to ProRL’s 62.02. Similar improvements were seen across Code and Reasoning Gym benchmarks, with BroRL achieving state-of-the-art results for a 1.5 billion parameter model. Beyond performance, BroRL is significantly more compute-efficient. It achieves higher accuracy with fewer output tokens, indicating better reasoning efficiency and reduced redundancy. By uncovering concise, high-quality reasoning paths early, BroRL reduces reliance on verbose, low-signal chains. On Math and Code tasks, BroRL produced higher scores using fewer tokens—up to 745 fewer on Math and 717 fewer on Code—demonstrating superior score-per-token efficiency. BroRL establishes rollout size as a critical scaling dimension, proving that performance ceilings from step-scaling methods are not inherent limits of reinforcement learning but artifacts of limited exploration. The key takeaway is simple: when progress stalls, broaden your exploration rather than extending training. The BroRL model is now available on Hugging Face for researchers and developers to explore and integrate into their workflows. This advancement opens a new path for unlocking deeper reasoning capabilities in language models through smarter, more efficient scaling.