Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Scaling test-time compute is crucial for enhancing the reasoning capabilitiesof large language models (LLMs). Existing approaches typically employreinforcement learning (RL) to maximize a verifiable reward obtained at the endof reasoning traces. However, such methods optimize only the final performanceunder a large and fixed token budget, which hinders efficiency in both trainingand deployment. In this work, we present a novel framework, AnytimeReasoner, tooptimize anytime reasoning performance, which aims to improve token efficiencyand the flexibility of reasoning under varying token budget constraints. Toachieve this, we truncate the complete thinking process to fit within sampledtoken budgets from a prior distribution, compelling the model to summarize theoptimal answer for each truncated thinking for verification. This introducesverifiable dense rewards into the reasoning process, facilitating moreeffective credit assignment in RL optimization. We then optimize the thinkingand summary policies in a decoupled manner to maximize the cumulative reward.Additionally, we introduce a novel variance reduction technique, BudgetRelative Policy Optimization (BRPO), to enhance the robustness and efficiencyof the learning process when reinforcing the thinking policy. Empirical resultsin mathematical reasoning tasks demonstrate that our method consistentlyoutperforms GRPO across all thinking budgets under various prior distributions,enhancing both training and token efficiency.