3달 전

Jiachun Li Shaoping Huang Zhuoran Jin Chenlong Zhang Pengfei Cao Yubo Chen Kang Liu Jun Zhao

초록

최근 다중모달 대규모 언어 모델(Multimodal Large Language Models, MLLMs)의 추론 능력이 크게 향상되면서 과학적 분석 및 수학적 추론과 같은 보다 복잡한 과제를 해결할 수 있게 되었다. 그럼에도 불구하고, MLLMs의 실제 생활 상황에서 다양한 시나리오에 걸쳐 나타나는 추론 능력은 여전히 거의 탐색되지 않았으며, 평가를 위한 표준화된 벤치마크도 부족한 실정이다. 이러한 격차를 보완하기 위해, 우리는 실제 생활 상황에서 MLLMs의 다양한 다중모달 다중이미지 추론 능력을 평가할 수 있도록 설계된 종합적 벤치마크인 MMR-Life를 제안한다. MMR-Life는 19,108개의 이미지에서 주로 유래한 2,646개의 다지선다형 질문을 포함하며, 추론 유형 7종—추론적(아브덕티브), 유사성 기반(애널로지컬), 인과적, 연역적, 귀납적, 공간적, 시간적—을 포괄적으로 다룬다. 기존의 추론 벤치마크들과 달리, MMR-Life는 특정 도메인 전문 지식에 의존하지 않고, 모델이 다수의 이미지 간 정보를 통합하고 다양한 추론 능력을 적용하도록 요구한다. 37개의 고성능 모델에 대한 평가 결과는 MMR-Life가 여전히 매우 도전적인 평가 기준임을 보여준다. 최고 수준의 모델인 GPT-5조차도 정확도가 58%에 그치며, 추론 유형에 따라 성능 차이가 상당히 크다는 점이 확인되었다. 또한, 기존 MLLMs의 추론 패러다임을 분석하여 사고 길이, 추론 방법, 추론 유형 등의 요인이 모델 성능에 미치는 영향을 탐구하였다. 종합적으로 MMR-Life는 다음 세대 다중모달 추론 시스템의 평가, 분석 및 개선을 위한 포괄적인 기반을 마련하였다.

One-sentence Summary

Researchers from UCAS and CAS introduce MMR-Life, a real-world multimodal benchmark with 2,646 questions across seven reasoning types, challenging 37 models including GPT-5; it reveals critical gaps in multi-image reasoning and guides future MLLM development beyond domain-specific tasks.

Key Contributions

We introduce MMR-Life, the first comprehensive benchmark for evaluating multimodal multi-image reasoning in real-life scenarios, featuring 2,646 questions across seven reasoning types—abductive, analogical, causal, deductive, inductive, spatial, and temporal—based on 19,108 real-world images without requiring domain-specific expertise.
Evaluations on 37 state-of-the-art MLLMs, including GPT-5 and Gemini-2.5-Pro, reveal significant performance gaps, with top models achieving only ~58% accuracy and showing strong disparities across reasoning types, particularly struggling in causal, spatial, and temporal reasoning.
Through in-depth analysis using MMR-Life, we uncover key insights into MLLM reasoning paradigms, such as the limited benefit of extended thinking length for most reasoning types and the clustering of reasoning behaviors, offering actionable directions for improving next-generation multimodal systems.

Introduction

The authors leverage the growing capabilities of multimodal large language models (MLLMs) to tackle complex reasoning tasks, but note that most existing benchmarks focus on either expert-level knowledge or synthetic puzzles — both poorly aligned with real-world visual reasoning, which typically involves multiple images and commonsense logic. Prior work also largely ignores multi-image inputs or restricts evaluation to narrow reasoning types, failing to capture the diversity of everyday scenarios. Their main contribution is MMR-Life, a new benchmark comprising 2,646 questions across seven reasoning types — abductive, analogical, causal, deductive, inductive, spatial, and temporal — all grounded in real-life image sets. Evaluating 37 state-of-the-art models reveals even top performers like GPT-5 struggle, achieving only 58% accuracy and showing significant weaknesses in causal, spatial, and temporal reasoning, highlighting critical gaps in current MLLM capabilities.

Dataset

The authors use MMR-Life, a novel multimodal benchmark designed to evaluate MLLMs on real-life reasoning tasks. Here’s how the dataset is structured and used:

Composition and Sources:
MMR-Life contains 2,646 multiple-choice questions based on 19,108 real-world images, covering 7 reasoning types (abductive, analogical, causal, deductive, inductive, spatial, temporal) across 21 tasks. Images are sourced from:
- Public datasets (e.g., Kaggle) with high-resolution, contextually related images
- Web screenshots (e.g., eBird for bird distribution)
- Public video frames, extracted and filtered for clarity
- Existing multi-image or video reasoning benchmarks
  All images are natural photos—no symbolic diagrams or artificial graphics are included.
Key Subset Details:
- Each question requires at least two images.
- Questions are generated via automated rules (for explicit visual cues) or manual annotation (for implicit reasoning).
- Five answer options per question: one correct, four incorrect. Incorrect options are generated via heuristic sampling (for image choices) or LLMs (GPT-5-mini, GPT-4o, Qwen2.5-VL-32B) and manually refined.
- 3.2K total QA pairs were initially created; filtered down to 2,646 via quality control.
- Filtering steps:
  1. Difficulty: Remove questions answered correctly by 3 small MLLMs (Qwen2.5-VL-7B, Gemma3-4B, InternVL3.5-8B).
  2. Format: Manual revision to align incorrect options with correct ones in length and structure.
  3. Quality: Co-authors review and remove ambiguous, multi-answer, or domain-specific questions.
Usage in Training/Experiments:
- The dataset is used for evaluation only—not for training.
- No mixture ratios or training splits are applied; all models are tested on the full benchmark.
- The paper does not mention cropping or resizing; images are used as collected, with quality and clarity prioritized during extraction.
Processing and Metadata:
- All annotations follow strict guidelines: English-only, no domain expertise required, unambiguous, and aligned with reasoning type definitions.
- Metadata includes reasoning type, source, and image count per question.
- Ethical compliance: No copyrighted, private, or harmful content; no crowdsourcing; all annotators volunteered.
- Reproducibility: Full data sources, annotation prompts, and a 210-item subset are provided in appendices and supplementary materials.

Method

The authors leverage a structured prompting framework to guide multimodal reasoning models through complex tasks involving multiple images. This framework is designed to enforce consistent output formats while encouraging step-by-step reasoning, which is critical for generating reliable negative options and validating correct answers. Each prompt template is tailored to a specific output structure, ensuring that the model’s response aligns with the expected semantic and syntactic constraints of the task.

For instance, in the case of generating negative options for reasoning tasks, the authors employ a series of prompts that progressively constrain the output format. One such prompt instructs the model to produce a sequence of image indices, formatted as “x-x-x-x…”, where each ‘x’ corresponds to a specific image in the input set. This structured output facilitates downstream evaluation and comparison against ground truth sequences.

Another variant of the prompt restricts the output to directional responses, requiring the model to select from a predefined set of eight common directions. This constraint is particularly useful in navigation or spatial reasoning tasks where directional accuracy is paramount.

In tasks requiring sequential action planning, the authors introduce a numbered action format, where each step must be prefixed with an integer and selected from a limited set of actions such as “Turn Left,” “Turn Right,” or “Go forward until the xxx.” This ensures that the generated sequences are both semantically valid and executable.

For multiple-choice question answering, the authors adopt a Chain-of-Thought (CoT) style prompt that explicitly directs the model to select from a fixed set of options (A/B/C/D/E). This format not only standardizes the output but also encourages the model to articulate its reasoning before arriving at the final selection.

Across all prompt variants, the authors consistently include the directive “Let’s think step by step before answering,” which serves as a meta-instruction to activate the model’s reasoning capabilities. This design choice reflects a deliberate effort to scaffold complex reasoning through structured output constraints and explicit reasoning prompts.

Experiment

MMR-Life benchmark reveals significant gaps between current MLLMs and human performance, especially in real-life reasoning scenarios, with even top models like GPT-5 scoring 14% below humans.
Models show strong performance in analogical and deductive reasoning but struggle notably in spatial, temporal, and causal reasoning, highlighting a bias toward pattern association over abstract world modeling.
Adding “thinking” modes improves performance in closed-source models but offers little to no benefit for open-source models, suggesting current open-source thinking frameworks lack generalization to real-world contexts.
Longer reasoning chains correlate logarithmically with higher accuracy overall, but this benefit is task-dependent—inductive reasoning often degrades with extended CoT, while analogical reasoning improves.
Standard reasoning-enhancement methods like BoN and GRPO show diminishing returns or even performance drops on larger models, indicating limited generalizability as model scale increases.
Reinforcement learning methods underperform compared to inference-time techniques like Best-of-N on small models, raising questions about RL’s effectiveness for reasoning generalization.
Reasoning types exhibit varying correlations; analogical and inductive reasoning are highly correlated, while spatial reasoning is isolated, suggesting distinct underlying cognitive patterns.
Error analysis of top models reveals dominant failures in logical reasoning (e.g., causal inversion, temporal confusion), abstraction, knowledge recall, and perception, pointing to core limitations in current MLLMs.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While longer reasoning chains generally correlate with better accuracy, this benefit is not universal and varies by reasoning type, with some tasks like inductive reasoning showing no improvement or even degradation with extended CoT. Open-source thinking models show little to no advantage over their non-thinking counterparts, suggesting current reasoning enhancements are not yet effective for real-world generalization.

The authors use a diverse set of 2,646 questions across seven reasoning types to evaluate multimodal language models, revealing that even top-performing models struggle with real-world reasoning tasks compared to human performance. Results show significant performance gaps in spatial and temporal reasoning, while analogical and deductive reasoning are relatively better handled, highlighting a need for models to better learn abstract world representations. Current open-source thinking models do not consistently outperform their non-thinking counterparts, suggesting limited generalization in real-world contexts despite longer reasoning processes.

The authors evaluate multiple reasoning enhancement methods across Qwen2.5-VL models of varying sizes and find that performance gains from techniques like Self-Consistency and Best-of-N diminish as model scale increases, with larger models sometimes performing worse under these methods than with basic CoT. For the 72B model, GRPO and BoN show no consistent advantage over CoT, suggesting that advanced reasoning methods may not generalize well to larger architectures. Results indicate that model scale alone does not guarantee improved reasoning, and enhancement techniques must be carefully matched to model capacity and task type.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While thinking modes improve closed-source models, they offer little to no benefit for open-source models, and longer reasoning chains do not consistently enhance accuracy across reasoning types. Results also show that current reasoning-enhancement methods like BoN and GRPO provide diminishing returns or even degrade performance on larger models, highlighting a need for more effective generalizable techniques.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While thinking modes improve closed-source models, they offer no consistent benefit for open-source models, and longer reasoning chains do not universally enhance accuracy across reasoning types. Results indicate that current models struggle with abstract, real-world reasoning despite excelling in analogical and deductive tasks, highlighting a need for training approaches that better generalize to everyday scenarios.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

3달 전

Jiachun Li Shaoping Huang Zhuoran Jin Chenlong Zhang Pengfei Cao Yubo Chen Kang Liu Jun Zhao

초록

One-sentence Summary

Key Contributions

We introduce MMR-Life, the first comprehensive benchmark for evaluating multimodal multi-image reasoning in real-life scenarios, featuring 2,646 questions across seven reasoning types—abductive, analogical, causal, deductive, inductive, spatial, and temporal—based on 19,108 real-world images without requiring domain-specific expertise.
Evaluations on 37 state-of-the-art MLLMs, including GPT-5 and Gemini-2.5-Pro, reveal significant performance gaps, with top models achieving only ~58% accuracy and showing strong disparities across reasoning types, particularly struggling in causal, spatial, and temporal reasoning.
Through in-depth analysis using MMR-Life, we uncover key insights into MLLM reasoning paradigms, such as the limited benefit of extended thinking length for most reasoning types and the clustering of reasoning behaviors, offering actionable directions for improving next-generation multimodal systems.

Introduction

Dataset

The authors use MMR-Life, a novel multimodal benchmark designed to evaluate MLLMs on real-life reasoning tasks. Here’s how the dataset is structured and used:

Composition and Sources:
MMR-Life contains 2,646 multiple-choice questions based on 19,108 real-world images, covering 7 reasoning types (abductive, analogical, causal, deductive, inductive, spatial, temporal) across 21 tasks. Images are sourced from:
- Public datasets (e.g., Kaggle) with high-resolution, contextually related images
- Web screenshots (e.g., eBird for bird distribution)
- Public video frames, extracted and filtered for clarity
- Existing multi-image or video reasoning benchmarks
  All images are natural photos—no symbolic diagrams or artificial graphics are included.
Key Subset Details:
- Each question requires at least two images.
- Questions are generated via automated rules (for explicit visual cues) or manual annotation (for implicit reasoning).
- Five answer options per question: one correct, four incorrect. Incorrect options are generated via heuristic sampling (for image choices) or LLMs (GPT-5-mini, GPT-4o, Qwen2.5-VL-32B) and manually refined.
- 3.2K total QA pairs were initially created; filtered down to 2,646 via quality control.
- Filtering steps:
  1. Difficulty: Remove questions answered correctly by 3 small MLLMs (Qwen2.5-VL-7B, Gemma3-4B, InternVL3.5-8B).
  2. Format: Manual revision to align incorrect options with correct ones in length and structure.
  3. Quality: Co-authors review and remove ambiguous, multi-answer, or domain-specific questions.
Usage in Training/Experiments:
- The dataset is used for evaluation only—not for training.
- No mixture ratios or training splits are applied; all models are tested on the full benchmark.
- The paper does not mention cropping or resizing; images are used as collected, with quality and clarity prioritized during extraction.
Processing and Metadata:
- All annotations follow strict guidelines: English-only, no domain expertise required, unambiguous, and aligned with reasoning type definitions.
- Metadata includes reasoning type, source, and image count per question.
- Ethical compliance: No copyrighted, private, or harmful content; no crowdsourcing; all annotators volunteered.
- Reproducibility: Full data sources, annotation prompts, and a 210-item subset are provided in appendices and supplementary materials.

Method

Experiment

MMR-Life benchmark reveals significant gaps between current MLLMs and human performance, especially in real-life reasoning scenarios, with even top models like GPT-5 scoring 14% below humans.
Models show strong performance in analogical and deductive reasoning but struggle notably in spatial, temporal, and causal reasoning, highlighting a bias toward pattern association over abstract world modeling.
Adding “thinking” modes improves performance in closed-source models but offers little to no benefit for open-source models, suggesting current open-source thinking frameworks lack generalization to real-world contexts.
Longer reasoning chains correlate logarithmically with higher accuracy overall, but this benefit is task-dependent—inductive reasoning often degrades with extended CoT, while analogical reasoning improves.
Standard reasoning-enhancement methods like BoN and GRPO show diminishing returns or even performance drops on larger models, indicating limited generalizability as model scale increases.
Reinforcement learning methods underperform compared to inference-time techniques like Best-of-N on small models, raising questions about RL’s effectiveness for reasoning generalization.
Reasoning types exhibit varying correlations; analogical and inductive reasoning are highly correlated, while spatial reasoning is isolated, suggesting distinct underlying cognitive patterns.
Error analysis of top models reveals dominant failures in logical reasoning (e.g., causal inversion, temporal confusion), abstraction, knowledge recall, and perception, pointing to core limitations in current MLLMs.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While thinking modes improve closed-source models, they offer little to no benefit for open-source models, and longer reasoning chains do not consistently enhance accuracy across reasoning types. Results also show that current reasoning-enhancement methods like BoN and GRPO provide diminishing returns or even degrade performance on larger models, highlighting a need for more effective generalizable techniques.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While thinking modes improve closed-source models, they offer no consistent benefit for open-source models, and longer reasoning chains do not universally enhance accuracy across reasoning types. Results indicate that current models struggle with abstract, real-world reasoning despite excelling in analogical and deductive tasks, highlighting a need for training approaches that better generalize to everyday scenarios.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

MMR-Life: 다중 모달 다중 이미지 추론을 위한 실제 장면의 조각 맞추기

Jiachun Li Shaoping Huang Zhuoran Jin Chenlong Zhang Pengfei Cao Yubo Chen Kang Liu Jun Zhao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

MMR-Life: 다중 모달 다중 이미지 추론을 위한 실제 장면의 조각 맞추기

Jiachun Li Shaoping Huang Zhuoran Jin Chenlong Zhang Pengfei Cao Yubo Chen Kang Liu Jun Zhao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

MMR-Life: 다중 모달 다중 이미지 추론을 위한 실제 장면의 조각 맞추기

Jiachun Li Shaoping Huang Zhuoran Jin Chenlong Zhang Pengfei Cao Yubo Chen Kang Liu Jun Zhao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters