HyperAI

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Veröffentlichungsdatum: 5/11/2025
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
  Fine-Tuning
Abstract

Recent advances in multimodal Reward Models (RMs) have shown significantpromise in delivering reward signals to align vision models with humanpreferences. However, current RMs are generally restricted to providing directresponses or engaging in shallow reasoning processes with limited depth, oftenleading to inaccurate reward signals. We posit that incorporating explicit longchains of thought (CoT) into the reward reasoning process can significantlystrengthen their reliability and robustness. Furthermore, we believe that onceRMs internalize CoT reasoning, their direct response accuracy can also beimproved through implicit reasoning capabilities. To this end, this paperproposes UnifiedReward-Think, the first unified multimodal CoT-based rewardmodel, capable of multi-dimensional, step-by-step long-chain reasoning for bothvisual understanding and generation reward tasks. Specifically, we adopt anexploration-driven reinforcement fine-tuning approach to elicit and incentivizethe model's latent complex reasoning ability: (1) We first use a small amountof image generation preference data to distill the reasoning process of GPT-4o,which is then used for the model's cold start to learn the format and structureof CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledgeand generalization capabilities, we prepare large-scale unified multimodalpreference data to elicit the model's reasoning process across various visiontasks. During this phase, correct reasoning outputs are retained for rejectionsampling to refine the model (3) while incorrect predicted samples are finallyused for Group Relative Policy Optimization (GRPO) based reinforcementfine-tuning, enabling the model to explore diverse reasoning paths and optimizefor correct and robust solutions. Extensive experiments across various visionreward tasks demonstrate the superiority of our model.