SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan

公開日: 6/4/2025

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning

要約

Multimodal large language models (MLLMs) have shown promising capabilities inreasoning tasks, yet still struggle with complex problems requiring explicitself-reflection and self-correction, especially compared to their unimodaltext-based counterparts. Existing reflection methods are simplistic andstruggle to generate meaningful and instructive feedback, as the reasoningability and knowledge limits of pre-trained models are largely fixed duringinitial training. To overcome these challenges, we propose MultimodalSelf-Reflection enhanced reasoning with Group Relative Policy Optimization(SRPO), a two-stage reflection-aware reinforcement learning (RL) frameworkexplicitly designed to enhance multimodal LLM reasoning. In the first stage, weconstruct a high-quality, reflection-focused dataset under the guidance of anadvanced MLLM, which generates reflections based on initial responses to helpthe policy model learn both reasoning and self-reflection. In the second stage,we introduce a novel reward mechanism within the GRPO framework that encouragesconcise and cognitively meaningful reflection while avoiding redundancy.Extensive experiments across multiple multimodal reasoning benchmarks,including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7Band Qwen-2.5-VL-32B demonstrate that SRPO significantly outperformsstate-of-the-art models, achieving notable improvements in both reasoningaccuracy and reflection quality.

論文の詳細を見る