SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan

تاريخ النشر: 6/4/2025

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning

الملخص

Multimodal large language models (MLLMs) have shown promising capabilities inreasoning tasks, yet still struggle with complex problems requiring explicitself-reflection and self-correction, especially compared to their unimodaltext-based counterparts. Existing reflection methods are simplistic andstruggle to generate meaningful and instructive feedback, as the reasoningability and knowledge limits of pre-trained models are largely fixed duringinitial training. To overcome these challenges, we propose MultimodalSelf-Reflection enhanced reasoning with Group Relative Policy Optimization(SRPO), a two-stage reflection-aware reinforcement learning (RL) frameworkexplicitly designed to enhance multimodal LLM reasoning. In the first stage, weconstruct a high-quality, reflection-focused dataset under the guidance of anadvanced MLLM, which generates reflections based on initial responses to helpthe policy model learn both reasoning and self-reflection. In the second stage,we introduce a novel reward mechanism within the GRPO framework that encouragesconcise and cognitively meaningful reflection while avoiding redundancy.Extensive experiments across multiple multimodal reasoning benchmarks,including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7Band Qwen-2.5-VL-32B demonstrate that SRPO significantly outperformsstate-of-the-art models, achieving notable improvements in both reasoningaccuracy and reflection quality.

عرض تفاصيل الورقة البحثية