HyperAI

RM-R1: Reward Modeling as Reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Date de publication: 5/8/2025
RM-R1: Reward Modeling as Reasoning
Résumé

Reward modeling is essential for aligning large language models (LLMs) withhuman preferences, especially through reinforcement learning from humanfeedback (RLHF). To provide accurate reward signals, a reward model (RM) shouldstimulate deep thinking and conduct interpretable reasoning before assigning ascore or a judgment. However, existing RMs either produce opaque scalar scoresor directly generate the prediction of a preferred answer, making them struggleto integrate natural language critiques, thus lacking interpretability.Inspired by recent advances of long chain-of-thought (CoT) onreasoning-intensive tasks, we hypothesize and validate that integratingreasoning capabilities into reward modeling significantly enhances RM'sinterpretability and performance. In this work, we introduce a new class ofgenerative reward models -- Reasoning Reward Models (ReasRMs) -- whichformulate reward modeling as a reasoning task. We propose a reasoning-orientedtraining pipeline and train a family of ReasRMs, RM-R1. The training consistsof two key stages: (1) distillation of high-quality reasoning chains and (2)reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts byself-generating reasoning traces or chat-specific rubrics and evaluatingcandidate responses against them. Empirically, our models achievestate-of-the-art or near state-of-the-art performance of generative RMs acrossmultiple comprehensive reward model benchmarks, outperforming much largeropen-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) byup to 13.8%. Beyond final performance, we perform thorough empirical analysisto understand the key ingredients of successful ReasRM training. To facilitatefuture research, we release six ReasRM models along with code and data athttps://github.com/RM-R1-UIUC/RM-R1.