HyperAIHyperAI

Command Palette

Search for a command to run...

MMR-Life : Assembler des scènes de la vie réelle pour le raisonnement multimodal sur plusieurs images

Jiachun Li Shaoping Huang Zhuoran Jin Chenlong Zhang Pengfei Cao Yubo Chen Kang Liu Jun Zhao

Résumé

Les progrès récents dans les capacités de raisonnement des modèles de langage à grande échelle multimodaux (MLLM) ont permis à ces derniers de s'attaquer à des tâches plus complexes, telles que l'analyse scientifique ou le raisonnement mathématique. Malgré leur potentiel, les capacités de raisonnement des MLLM dans des scénarios réels restent largement inexplorées, et manquent de benchmarks standardisés pour leur évaluation. Pour combler cet écart, nous introduisons MMR-Life, un benchmark complet conçu pour évaluer les capacités diversifiées de raisonnement multimodal à partir de plusieurs images dans des scénarios de la vie réelle. MMR-Life comprend 2 646 questions à choix multiples fondées sur 19 108 images, principalement issues de contextes réels, couvrant de manière exhaustive sept types de raisonnement : abductif, analogique, causal, déductif, inductif, spatial et temporel. Contrairement aux benchmarks de raisonnement existants, MMR-Life ne repose pas sur des expertises spécifiques à un domaine, mais exige que les modèles intègrent des informations provenant de plusieurs images et appliquent diverses formes de raisonnement. L’évaluation de 37 modèles avancés met en évidence le défi considérable posé par MMR-Life : même les meilleurs modèles, comme GPT-5, atteignent une précision de seulement 58 %, tout en affichant une variation notable selon les types de raisonnement. En outre, nous analysons les paradigmes de raisonnement des MLLM existants, en explorant l’impact de facteurs tels que la longueur de la réflexion, la méthode de raisonnement et le type de raisonnement sur leurs performances. En résumé, MMR-Life établit une base solide et complète pour évaluer, analyser et améliorer la prochaine génération de systèmes de raisonnement multimodal.

One-sentence Summary

Researchers from UCAS and CAS introduce MMR-Life, a real-world multimodal benchmark with 2,646 questions across seven reasoning types, challenging 37 models including GPT-5; it reveals critical gaps in multi-image reasoning and guides future MLLM development beyond domain-specific tasks.

Key Contributions

  • We introduce MMR-Life, the first comprehensive benchmark for evaluating multimodal multi-image reasoning in real-life scenarios, featuring 2,646 questions across seven reasoning types—abductive, analogical, causal, deductive, inductive, spatial, and temporal—based on 19,108 real-world images without requiring domain-specific expertise.
  • Evaluations on 37 state-of-the-art MLLMs, including GPT-5 and Gemini-2.5-Pro, reveal significant performance gaps, with top models achieving only ~58% accuracy and showing strong disparities across reasoning types, particularly struggling in causal, spatial, and temporal reasoning.
  • Through in-depth analysis using MMR-Life, we uncover key insights into MLLM reasoning paradigms, such as the limited benefit of extended thinking length for most reasoning types and the clustering of reasoning behaviors, offering actionable directions for improving next-generation multimodal systems.

Introduction

The authors leverage the growing capabilities of multimodal large language models (MLLMs) to tackle complex reasoning tasks, but note that most existing benchmarks focus on either expert-level knowledge or synthetic puzzles — both poorly aligned with real-world visual reasoning, which typically involves multiple images and commonsense logic. Prior work also largely ignores multi-image inputs or restricts evaluation to narrow reasoning types, failing to capture the diversity of everyday scenarios. Their main contribution is MMR-Life, a new benchmark comprising 2,646 questions across seven reasoning types — abductive, analogical, causal, deductive, inductive, spatial, and temporal — all grounded in real-life image sets. Evaluating 37 state-of-the-art models reveals even top performers like GPT-5 struggle, achieving only 58% accuracy and showing significant weaknesses in causal, spatial, and temporal reasoning, highlighting critical gaps in current MLLM capabilities.

Dataset

The authors use MMR-Life, a novel multimodal benchmark designed to evaluate MLLMs on real-life reasoning tasks. Here’s how the dataset is structured and used:

  • Composition and Sources:
    MMR-Life contains 2,646 multiple-choice questions based on 19,108 real-world images, covering 7 reasoning types (abductive, analogical, causal, deductive, inductive, spatial, temporal) across 21 tasks. Images are sourced from:

    • Public datasets (e.g., Kaggle) with high-resolution, contextually related images
    • Web screenshots (e.g., eBird for bird distribution)
    • Public video frames, extracted and filtered for clarity
    • Existing multi-image or video reasoning benchmarks
      All images are natural photos—no symbolic diagrams or artificial graphics are included.
  • Key Subset Details:

    • Each question requires at least two images.
    • Questions are generated via automated rules (for explicit visual cues) or manual annotation (for implicit reasoning).
    • Five answer options per question: one correct, four incorrect. Incorrect options are generated via heuristic sampling (for image choices) or LLMs (GPT-5-mini, GPT-4o, Qwen2.5-VL-32B) and manually refined.
    • 3.2K total QA pairs were initially created; filtered down to 2,646 via quality control.
    • Filtering steps:
      1. Difficulty: Remove questions answered correctly by 3 small MLLMs (Qwen2.5-VL-7B, Gemma3-4B, InternVL3.5-8B).
      2. Format: Manual revision to align incorrect options with correct ones in length and structure.
      3. Quality: Co-authors review and remove ambiguous, multi-answer, or domain-specific questions.
  • Usage in Training/Experiments:

    • The dataset is used for evaluation only—not for training.
    • No mixture ratios or training splits are applied; all models are tested on the full benchmark.
    • The paper does not mention cropping or resizing; images are used as collected, with quality and clarity prioritized during extraction.
  • Processing and Metadata:

    • All annotations follow strict guidelines: English-only, no domain expertise required, unambiguous, and aligned with reasoning type definitions.
    • Metadata includes reasoning type, source, and image count per question.
    • Ethical compliance: No copyrighted, private, or harmful content; no crowdsourcing; all annotators volunteered.
    • Reproducibility: Full data sources, annotation prompts, and a 210-item subset are provided in appendices and supplementary materials.

Method

The authors leverage a structured prompting framework to guide multimodal reasoning models through complex tasks involving multiple images. This framework is designed to enforce consistent output formats while encouraging step-by-step reasoning, which is critical for generating reliable negative options and validating correct answers. Each prompt template is tailored to a specific output structure, ensuring that the model’s response aligns with the expected semantic and syntactic constraints of the task.

For instance, in the case of generating negative options for reasoning tasks, the authors employ a series of prompts that progressively constrain the output format. One such prompt instructs the model to produce a sequence of image indices, formatted as “x-x-x-x…”, where each ‘x’ corresponds to a specific image in the input set. This structured output facilitates downstream evaluation and comparison against ground truth sequences.

Another variant of the prompt restricts the output to directional responses, requiring the model to select from a predefined set of eight common directions. This constraint is particularly useful in navigation or spatial reasoning tasks where directional accuracy is paramount.

In tasks requiring sequential action planning, the authors introduce a numbered action format, where each step must be prefixed with an integer and selected from a limited set of actions such as “Turn Left,” “Turn Right,” or “Go forward until the xxx.” This ensures that the generated sequences are both semantically valid and executable.

For multiple-choice question answering, the authors adopt a Chain-of-Thought (CoT) style prompt that explicitly directs the model to select from a fixed set of options (A/B/C/D/E). This format not only standardizes the output but also encourages the model to articulate its reasoning before arriving at the final selection.

Across all prompt variants, the authors consistently include the directive “Let’s think step by step before answering,” which serves as a meta-instruction to activate the model’s reasoning capabilities. This design choice reflects a deliberate effort to scaffold complex reasoning through structured output constraints and explicit reasoning prompts.

Experiment

  • MMR-Life benchmark reveals significant gaps between current MLLMs and human performance, especially in real-life reasoning scenarios, with even top models like GPT-5 scoring 14% below humans.
  • Models show strong performance in analogical and deductive reasoning but struggle notably in spatial, temporal, and causal reasoning, highlighting a bias toward pattern association over abstract world modeling.
  • Adding “thinking” modes improves performance in closed-source models but offers little to no benefit for open-source models, suggesting current open-source thinking frameworks lack generalization to real-world contexts.
  • Longer reasoning chains correlate logarithmically with higher accuracy overall, but this benefit is task-dependent—inductive reasoning often degrades with extended CoT, while analogical reasoning improves.
  • Standard reasoning-enhancement methods like BoN and GRPO show diminishing returns or even performance drops on larger models, indicating limited generalizability as model scale increases.
  • Reinforcement learning methods underperform compared to inference-time techniques like Best-of-N on small models, raising questions about RL’s effectiveness for reasoning generalization.
  • Reasoning types exhibit varying correlations; analogical and inductive reasoning are highly correlated, while spatial reasoning is isolated, suggesting distinct underlying cognitive patterns.
  • Error analysis of top models reveals dominant failures in logical reasoning (e.g., causal inversion, temporal confusion), abstraction, knowledge recall, and perception, pointing to core limitations in current MLLMs.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While longer reasoning chains generally correlate with better accuracy, this benefit is not universal and varies by reasoning type, with some tasks like inductive reasoning showing no improvement or even degradation with extended CoT. Open-source thinking models show little to no advantage over their non-thinking counterparts, suggesting current reasoning enhancements are not yet effective for real-world generalization.

The authors use a diverse set of 2,646 questions across seven reasoning types to evaluate multimodal language models, revealing that even top-performing models struggle with real-world reasoning tasks compared to human performance. Results show significant performance gaps in spatial and temporal reasoning, while analogical and deductive reasoning are relatively better handled, highlighting a need for models to better learn abstract world representations. Current open-source thinking models do not consistently outperform their non-thinking counterparts, suggesting limited generalization in real-world contexts despite longer reasoning processes.

The authors evaluate multiple reasoning enhancement methods across Qwen2.5-VL models of varying sizes and find that performance gains from techniques like Self-Consistency and Best-of-N diminish as model scale increases, with larger models sometimes performing worse under these methods than with basic CoT. For the 72B model, GRPO and BoN show no consistent advantage over CoT, suggesting that advanced reasoning methods may not generalize well to larger architectures. Results indicate that model scale alone does not guarantee improved reasoning, and enhancement techniques must be carefully matched to model capacity and task type.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While thinking modes improve closed-source models, they offer little to no benefit for open-source models, and longer reasoning chains do not consistently enhance accuracy across reasoning types. Results also show that current reasoning-enhancement methods like BoN and GRPO provide diminishing returns or even degrade performance on larger models, highlighting a need for more effective generalizable techniques.

The authors evaluate 37 multimodal language models on the MMR-Life benchmark, revealing that even top closed-source models like GPT-5 fall significantly short of human performance, particularly in spatial and temporal reasoning. While thinking modes improve closed-source models, they offer no consistent benefit for open-source models, and longer reasoning chains do not universally enhance accuracy across reasoning types. Results indicate that current models struggle with abstract, real-world reasoning despite excelling in analogical and deductive tasks, highlighting a need for training approaches that better generalize to everyday scenarios.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp