HyperAI
8 days ago

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking
  Reasoning
Abstract

Enhancing large vision-language models (LVLMs) with visual slow-thinkingreasoning is crucial for solving complex multimodal tasks. However, since LVLMsare mainly trained with vision-language alignment, it is difficult to adopton-policy reinforcement learning (RL) to develop the slow thinking abilitybecause the rollout space is restricted by its initial abilities. Off-policy RLoffers a way to go beyond the current policy, but directly distillingtrajectories from external models may cause visual hallucinations due tomismatched visual perception abilities across models. To address these issues,this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL forvision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policybehavior model by combining on-policy visual understanding from a trainableLVLM with off-policy slow-thinking reasoning from a language model, assignsoutcome-based rewards to reasoning, and propagates visual rewards backward.Then LVLM learns slow-thinking reasoning ability from the obtained reasoningtrajectories using propagated rewards via off-policy RL algorithms. Extensiveexperiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show theeffectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% inaverage, reaching state-of-the-art performance among open-source LVLMs onmultiple multimodal reasoning benchmarks, and even outperforms someclosed-source models (e.g., GPT-4.1) on the challenging MathVision andOlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively.Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policyRL methods, offering a better policy initialization for further on-policytraining.