HyperAI超神経

RM-R1: Reward Modeling as Reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
公開日: 5/8/2025
RM-R1: Reward Modeling as Reasoning
要約

Reward modeling is essential for aligning large language models (LLMs) withhuman preferences, especially through reinforcement learning from humanfeedback (RLHF). To provide accurate reward signals, a reward model (RM) shouldstimulate deep thinking and conduct interpretable reasoning before assigning ascore or a judgment. However, existing RMs either produce opaque scalar scoresor directly generate the prediction of a preferred answer, making them struggleto integrate natural language critiques, thus lacking interpretability.Inspired by recent advances of long chain-of-thought (CoT) onreasoning-intensive tasks, we hypothesize and validate that integratingreasoning capabilities into reward modeling significantly enhances RM'sinterpretability and performance. In this work, we introduce a new class ofgenerative reward models -- Reasoning Reward Models (ReasRMs) -- whichformulate reward modeling as a reasoning task. We propose a reasoning-orientedtraining pipeline and train a family of ReasRMs, RM-R1. The training consistsof two key stages: (1) distillation of high-quality reasoning chains and (2)reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts byself-generating reasoning traces or chat-specific rubrics and evaluatingcandidate responses against them. Empirically, our models achievestate-of-the-art or near state-of-the-art performance of generative RMs acrossmultiple comprehensive reward model benchmarks, outperforming much largeropen-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) byup to 13.8%. Beyond final performance, we perform thorough empirical analysisto understand the key ingredients of successful ReasRM training. To facilitatefuture research, we release six ReasRM models along with code and data athttps://github.com/RM-R1-UIUC/RM-R1.