Command Palette
Search for a command to run...
Runzhe Zhan Yafu Li Zhi Wang Xiaoye Qu Dongrui Liu Jing Shao Derek F. Wong Yu Cheng

Abstract
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigmfor improving the reasoning ability of large language models. However, standardon-policy training discards rollout experiences after a single update, leadingto computational inefficiency and instability. While prior work on RL hashighlighted the benefits of reusing past experience, the role of experiencecharacteristics in shaping learning dynamics of large reasoning models remainsunderexplored. In this paper, we are the first to investigate what makes areasoning experience valuable and identify rollout correctness and entropy aseffective indicators of experience value. Based on these insights, we proposeExGRPO (Experiential Group Relative Policy Optimization), a framework thatorganizes and prioritizes valuable experiences, and employs a mixed-policyobjective to balance exploration with experience exploitation. Experiments onfive backbone models (1.5B-8B parameters) show that ExGRPO consistentlyimproves reasoning performance on mathematical/general benchmarks, with anaverage gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPOstabilizes training on both stronger and weaker models where on-policy methodsfail. These results highlight principled experience management as a keyingredient for efficient and scalable RLVR.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.