Command Palette
Search for a command to run...
MM-CondChain: 視覚的根拠に基づく深層構成的推論のためのプログラム検証済みベンチマーク
MM-CondChain: 視覚的根拠に基づく深層構成的推論のためのプログラム検証済みベンチマーク
Haozhan Shen Shilin Yan Hongwei Xue Shuaiqi Lu Xiaojun Tang Guannan Zhang Tiancheng Zhao Jianwei Yin
概要
マルチモーダル大規模言語モデル(MLLM)は、GUI 操作のような視覚的ワークフローを実行するためにますます活用されている。これらのタスクでは、次のステップが「許可ダイアログが表示され、かつインターフェースの色が緑である場合、許可をクリックする」など、検証された視覚的構成条件に依存し、処理は分岐したり早期に終了したりする可能性がある。しかし、この能力の評価は依然として不十分である。既存のベンチマークは、浅い構成や独立した制約に焦点を当てており、深く連鎖する構成条件分岐には対応していない。本研究では、視覚的に裏付けられた深層構成推論のためのベンチマーク「MM-CondChain」を提案する。各ベンチマークインスタンスは多層推論チェーンとして構成され、各層には複数のオブジェクト、属性、または関係から構築され、視覚的証拠に基づいた非自明な構成条件が含まれる。正解を得るためには、MLLM が画像を詳細に知覚し、各ステップで複数の視覚要素にわたって推論を行い、その結果生じる実行経路を最終結果まで追跡する必要がある。このようなワークフロー形式のデータをスケーラブルに構築するため、我々はエージェント型合成パイプラインを提案する。このパイプラインでは、Planner が構成条件の層ごとの生成を調整し、検証可能なプログラム中間表現(VPIR: Verifiable Programmatic Intermediate Representation)が各層の条件を機械的に検証可能に保証する。その後、Composer がこれらの検証済み層を組み合わせて完全な指示を生成する。本パイプラインを用いて、自然画像、データチャート、GUI 軌跡の 3 つの視覚ドメインでベンチマークを構築した。複数の MLLM に対する実験結果は、最も強力なモデルであってもパス F1 スコアが 53.33 にとどまり、ハードネガティブ例では急激な性能低下が見られ、深さや述語の複雑さが増すにつれて性能が低下することを示している。これらの結果は、深層構成推論が依然として根本的な課題であることを裏付けている。
One-sentence Summary
Researchers from Alibaba Group and Zhejiang University introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning that employs a VPIR-based agentic pipeline to generate mechanically verifiable, multi-layer conditional chains, revealing that even state-of-the-art multimodal models struggle with complex visual workflows requiring precise step-by-step verification.
Key Contributions
- Existing benchmarks fail to evaluate deep compositional reasoning because they focus on shallow single-layer compositions or independent constraints rather than multi-layer visual workflows where each step determines the execution path.
- The authors introduce MM-CondChain, a benchmark featuring nested conditional chains grounded in visual evidence, constructed via an agentic synthesis pipeline that uses a Verifiable Programmatic Intermediate Representation to ensure mechanical verifiability.
- Experiments across natural images, data charts, and GUI trajectories reveal that even the strongest multimodal models achieve only 53.33 Path F1, demonstrating significant performance drops as reasoning depth and predicate complexity increase.
Introduction
Multimodal Large Language Models are increasingly deployed in complex visual workflows like GUI navigation, where subsequent actions depend on verifying chained visual conditions. However, existing benchmarks fail to evaluate this capability because they focus on shallow, single-layer compositions or independent constraints rather than deep, multi-step reasoning paths that branch or terminate based on visual evidence. To address this gap, the authors introduce MM-CondChain, a benchmark featuring multi-layer control flow with mechanically verified hard negatives. They achieve scalable and reliable data construction through an agentic synthesis pipeline that uses a Verifiable Programmatic Intermediate Representation to decouple logical condition generation from natural language rendering.
Dataset
-
Dataset Composition and Sources: The authors construct MM-CondChain from three distinct visual domains using publicly available datasets. The Natural domain includes 398 images from SAM and GQA, the Chart domain features 200 chart images from ChartQA, and the GUI domain comprises 377 interaction trajectories (totaling 3,421 screenshots) sourced from AITZ.
-
Key Details for Each Subset:
- Natural: Focuses on object attributes and spatial relations.
- Chart: Concentrates on numerical and structural statistics across bar, line, and pie charts.
- GUI: Emphasizes action, state, and trajectory-level metadata with fine-grained reasoning annotations.
- Total Volume: The benchmark contains 975 evaluation samples, where each sample consists of a paired True-path and False-path instance.
-
Data Usage and Processing:
- Synthesis Pipeline: The authors employ Gemini-3-Pro to instantiate all agents in the synthesis pipeline, including the Planner, Verifier, Fact Extractor, and Translator.
- Subject De-leakage: An MLLM-based rewriter modifies subject descriptions to remove condition-revealing attributes while ensuring the subject remains uniquely referential to the target object.
- Paired-Path Instantiation: Each control-flow skeleton generates two nearly isomorphic instances. The True-path follows all conditions to a terminal layer, while the False-path swaps a single condition at a randomly sampled divergence layer to trigger early termination.
- Instruction Compilation: The system merges subjects and conditions into fluent natural-language if-clauses to create nested instructions that serve as hard negatives.
-
Evaluation Strategy:
- Metrics: Performance is measured using True-path Accuracy, False-path Accuracy, and Path F1 (the harmonic mean of the two), with an overall score calculated as the average Path F1 across domains.
- Setup: Models are evaluated in a zero-shot setting using default API parameters, with answers extracted from multiple-choice outputs based on specific formatting rules.
Method
The authors propose a VPIR-based agentic benchmark construction pipeline that decouples logical construction from language rendering to address logical inconsistencies in multi-layer compositional reasoning. The core framework accepts multimodal inputs, including natural images, chart images with metadata, and GUI trajectories with annotations. Refer to the framework diagram for the overall architecture.
The pipeline operates through an iterative, multi-layer reasoning chain coordinated by a Planner. At each layer t, the Planner selects a relational strategy rt to determine how the reasoning chain evolves, choosing between actions such as EXTEND, FINISH, or ROLLBACK. This control mechanism ensures that the chain depth remains within a target interval while maintaining coherence.
Once a layer is initiated, the system executes a four-stage synthesis workflow. First, a Fact Extractor grounds the generation in visual evidence by selecting a subject St and producing structured facts Ft as a typed key-value mapping. This structured representation prevents hallucination and defines a programmatic namespace. Second, the VPIR Generator synthesizes a Verifiable Programmatic Intermediate Representation, consisting of a true-logic predicate pt and a counterfactual false-logic p~t. These predicates are executable Python-like code evaluated in a sandboxed environment to ensure mechanical verifiability.
As shown in the figure below:

Third, a Translator renders the verified executable logic into natural language condition texts ct and c~t. This step ensures that truth values are anchored in code execution rather than linguistic generation. Finally, a Composer compiles the verified chain into paired benchmark instances. It constructs a True-path where all conditions hold and a False-path where a single condition is replaced by a minimally perturbed counterfactual, creating hard negatives that require precise visual grounding and deep compositional reasoning.
Experiment
- Main evaluation on MM-CondChain reveals that current multimodal large language models struggle with visually grounded deep compositional reasoning, with even top performers achieving only slightly above 50% average Path F1.
- Experiments comparing true versus false paths show a significant bias where models over-assume conditions hold, leading to high accuracy on valid paths but poor performance on invalid ones, which poses risks in real-world workflows.
- Domain analysis indicates that GUI tasks are the most challenging due to the need for multi-frame trajectory and state transition reasoning, whereas chart tasks are comparatively easier as they often reduce to deterministic numerical comparisons.
- Ablation studies on chain depth demonstrate that performance degrades consistently as the number of sequential verification layers increases, confirming that errors compound across layers rather than remaining isolated.
- Tests on predicate complexity reveal that increasing the logical operators and nesting within a single condition causes substantial performance drops, highlighting that models struggle with both sequential and intra-layer compositional reasoning.
- Overall, the findings establish that chain depth and predicate complexity are two orthogonal axes of difficulty that jointly define the limits of current models, making the benchmark a valuable diagnostic tool for identifying specific reasoning failures.