a month ago

BaseReward: A Strong Baseline for Multimodal Reward Model

Yi-Fan Zhang Haihua Yang Huanyu Zhang Yang Shi Zezhou Chen Haochen Tian Chaoyou Fu Haotian Wang Kai Wu Bo Cui

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has madealigning them with human preferences a critical challenge. Reward Models (RMs)are a core technology for achieving this goal, but a systematic guide forbuilding state-of-the-art Multimodal Reward Models (MRMs) is currently lackingin both academia and industry. Through exhaustive experimental analysis, thispaper aims to provide a clear ``recipe'' for constructing high-performanceMRMs. We systematically investigate every crucial component in the MRMdevelopment pipeline, including reward modeling paradigms (e.g.,Naive-RM, Critic-based RM, and Generative RM), reward headarchitecture, training strategies, data curation (coveringover ten multimodal and text-only preference datasets), backbone modeland model scale, and ensemble methods. Based on these experimental insights, we introduce BaseReward, apowerful and efficient baseline for multimodal reward modeling. BaseRewardadopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone,featuring an optimized two-layer reward head, and is trained on a carefullycurated mixture of high-quality multimodal and text-only preference data. Ourresults show that BaseReward establishes a new SOTA on major benchmarks such asMM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench,outperforming previous models. Furthermore, to validate its practical utilitybeyond static benchmarks, we integrate BaseReward into a real-worldreinforcement learning pipeline, successfully enhancing an MLLM's performanceacross various perception, reasoning, and conversational tasks. This work notonly delivers a top-tier MRM but, more importantly, provides the community witha clear, empirically-backed guide for developing robust reward models for thenext generation of MLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

BaseReward: A Strong Baseline for Multimodal Reward Model

Yi-Fan Zhang Haihua Yang Huanyu Zhang Yang Shi Zezhou Chen Haochen Tian Chaoyou Fu Haotian Wang Kai Wu Bo Cui5 more

Abstract

Build AI with AI

Hyper Newsletters

Yi-Fan Zhang Haihua Yang Huanyu Zhang Yang Shi Zezhou Chen Haochen Tian Chaoyou Fu Haotian Wang Kai Wu Bo Cui