HyperAI

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Date de publication: 5/13/2025
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
  Reasoning Models
Résumé

Reasoning lies at the heart of intelligence, shaping the ability to makedecisions, draw conclusions, and generalize across domains. In artificialintelligence, as systems increasingly operate in open, uncertain, andmultimodal environments, reasoning becomes essential for enabling robust andadaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as apromising paradigm, integrating modalities such as text, images, audio, andvideo to support complex reasoning capabilities and aiming to achievecomprehensive perception, precise understanding, and deep reasoning. Asresearch advances, multimodal reasoning has rapidly evolved from modular,perception-driven pipelines to unified, language-centric frameworks that offermore coherent cross-modal understanding. While instruction tuning andreinforcement learning have improved model reasoning, significant challengesremain in omni-modal generalization, reasoning depth, and agentic behavior. Toaddress these issues, we present a comprehensive and structured survey ofmultimodal reasoning research, organized around a four-stage developmentalroadmap that reflects the field's shifting design philosophies and emergingcapabilities. First, we review early efforts based on task-specific modules,where reasoning was implicitly embedded across stages of representation,alignment, and fusion. Next, we examine recent approaches that unify reasoninginto multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT)and multimodal reinforcement learning enabling richer and more structuredreasoning chains. Finally, drawing on empirical insights from challengingbenchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss theconceptual direction of native large multimodal reasoning models (N-LMRMs),which aim to support scalable, agentic, and adaptive reasoning and planning incomplex, real-world environments.