Command Palette
Search for a command to run...

Abstract
We introduce rStar2-Agent, a 14B math reasoning model trained with agenticreinforcement learning to achieve frontier-level performance. Beyond currentlong CoT, the model demonstrates advanced cognitive behaviors, such as thinkingcarefully before using Python coding tools and reflecting on code executionfeedback to autonomously explore, verify, and refine intermediate steps incomplex problem-solving. This capability is enabled through three keyinnovations that makes agentic RL effective at scale: (i) an efficient RLinfrastructure with a reliable Python code environment that supportshigh-throughput execution and mitigates the high rollout costs, enablingtraining on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agenticRL algorithm with a Resample-on-Correct rollout strategy that addresses theinherent environment noises from coding tools, allowing the model to reasonmore effectively in a code environment; (iii) An efficient agent trainingrecipe that starts with non-reasoning SFT and progresses through multi-RLstages, yielding advanced cognitive abilities with minimal compute cost. Tothis end, rStar2-Agent boosts a pre-trained 14B model to state of the art inonly 510 RL steps within one week, achieving average pass@1 scores of 80.6% onAIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantlyshorter responses. Beyond mathematics, rStar2-Agent-14B also demonstratesstrong generalization to alignment, scientific reasoning, and agentic tool-usetasks. Code and training recipes are available athttps://github.com/microsoft/rStar.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.