Command Palette
Search for a command to run...
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Abstract
While current Multimodal Large Language Models (MLLMs) have demonstratedproficiency in reasoning tasks such as mathematics and logic, their capacityfor long-chain reflective reasoning, a prerequisite for solving complexreal-world problems, remains largely underexplored. In this work, we firstconduct an extensive empirical investigation to evaluate this capability.Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, amultimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasksthat require iterative thinking and backtracking. Empirical results on thisbenchmark reveal that existing MLLMs exhibit significant performance deficitsin long-chain reflective reasoning. To address this limitation, we generatepost-training data and further explore learning paradigms for exploiting suchdata. We first develop the Step-Elicited Response Generation pipeline to createMM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoningtraces for instruction-tuning stage. Given that standard Reinforcement Learningfails on complex tasks due to sparse reward signals and catastrophic forgettingafter Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization(AHPO), a novel training strategy that dynamically unifies offline supervisionand online optimization into a single stage. This strategy enables the model tolearn from expert data when rewards are sparse and conduct independentexploration once proficient. When applied to the Qwen2.5-VL-7B baseline, ourmethod achieves a +18.6\% accuracy improvement on MM-HELIX benchmark anddemonstrates strong generalization with a +5.7\% average performance gain ongeneral mathematic and logic tasks. Our work demonstrate that reflectivereasoning in MLLMs can be effectively learned and generalized, paving the wayfor developing more capable MLLMs.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.