HyperAI
4 days ago

RLPR: Extrapolating RLVR to General Domains without Verifiers

Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
RLPR: Extrapolating RLVR to General Domains without Verifiers
Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promisingpotential in advancing the reasoning capabilities of LLMs. However, its successremains largely confined to mathematical and code domains. This primarylimitation stems from the heavy reliance on domain-specific verifiers, whichresults in prohibitive complexity and limited scalability. To address thechallenge, our key observation is that LLM's intrinsic probability ofgenerating a correct free-form answer directly indicates its own evaluation ofthe reasoning reward (i.e., how well the reasoning process leads to the correctanswer). Building on this insight, we propose RLPR, a simple verifier-freeframework that extrapolates RLVR to broader general domains. RLPR uses theLLM's own token probability scores for reference answers as the reward signaland maximizes the expected reward during training. We find that addressing thehigh variance of this noisy probability reward is crucial to make it work, andpropose prob-to-reward and stabilizing methods to ensure a precise and stablereward from LLM intrinsic probabilities. Comprehensive experiments in fourgeneral-domain benchmarks and three mathematical benchmarks show that RLPRconsistently improves reasoning capabilities in both areas for Gemma, Llama,and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6points on TheoremQA and 7.5 points on Minerva, and even surpasses strongverifier-model-dependent approaches General-Reasoner by 1.6 average pointsacross seven benchmarks.