HyperAIHyperAI

Command Palette

Search for a command to run...

23 days ago

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with
  Holistic Platform and Adaptive Hybrid Policy Optimization

Abstract

While current Multimodal Large Language Models (MLLMs) have demonstratedproficiency in reasoning tasks such as mathematics and logic, their capacityfor long-chain reflective reasoning, a prerequisite for solving complexreal-world problems, remains largely underexplored. In this work, we firstconduct an extensive empirical investigation to evaluate this capability.Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, amultimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasksthat require iterative thinking and backtracking. Empirical results on thisbenchmark reveal that existing MLLMs exhibit significant performance deficitsin long-chain reflective reasoning. To address this limitation, we generatepost-training data and further explore learning paradigms for exploiting suchdata. We first develop the Step-Elicited Response Generation pipeline to createMM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoningtraces for instruction-tuning stage. Given that standard Reinforcement Learningfails on complex tasks due to sparse reward signals and catastrophic forgettingafter Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization(AHPO), a novel training strategy that dynamically unifies offline supervisionand online optimization into a single stage. This strategy enables the model tolearn from expert data when rewards are sparse and conduct independentexploration once proficient. When applied to the Qwen2.5-VL-7B baseline, ourmethod achieves a +18.6\% accuracy improvement on MM-HELIX benchmark anddemonstrates strong generalization with a +5.7\% average performance gain ongeneral mathematic and logic tasks. Our work demonstrate that reflectivereasoning in MLLMs can be effectively learned and generalized, paving the wayfor developing more capable MLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization | Papers | HyperAI