4 months ago

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang

Abstract

While current Multimodal Large Language Models (MLLMs) have demonstratedproficiency in reasoning tasks such as mathematics and logic, their capacityfor long-chain reflective reasoning, a prerequisite for solving complexreal-world problems, remains largely underexplored. In this work, we firstconduct an extensive empirical investigation to evaluate this capability.Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, amultimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasksthat require iterative thinking and backtracking. Empirical results on thisbenchmark reveal that existing MLLMs exhibit significant performance deficitsin long-chain reflective reasoning. To address this limitation, we generatepost-training data and further explore learning paradigms for exploiting suchdata. We first develop the Step-Elicited Response Generation pipeline to createMM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoningtraces for instruction-tuning stage. Given that standard Reinforcement Learningfails on complex tasks due to sparse reward signals and catastrophic forgettingafter Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization(AHPO), a novel training strategy that dynamically unifies offline supervisionand online optimization into a single stage. This strategy enables the model tolearn from expert data when rewards are sparse and conduct independentexploration once proficient. When applied to the Qwen2.5-VL-7B baseline, ourmethod achieves a +18.6% accuracy improvement on MM-HELIX benchmark anddemonstrates strong generalization with a +5.7% average performance gain ongeneral mathematic and logic tasks. Our work demonstrate that reflectivereasoning in MLLMs can be effectively learned and generalized, paving the wayfor developing more capable MLLMs.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 months ago

Multimodal

Reasoning

Supervised Fine-Tuning

Method/Architecture

Multimodality

Task/Problem

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 months ago

Multimodal

Reasoning

Supervised Fine-Tuning

Method/Architecture

Multimodality

Task/Problem

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang4 more

Abstract

Build AI with AI

HyperAI Newsletters

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang

Xiangyu Zhao Junming Lin Tianhao Liang Yifan Zhou Wenhao Chai Yuzhe Gu Weiyun Wang Kai Chen Gen Luo Wenwei Zhang