Command Palette
Search for a command to run...
MM-CondChain : une référence vérifiée algorithmiquement pour le raisonnement compositionnel profond ancré visuellement
MM-CondChain : une référence vérifiée algorithmiquement pour le raisonnement compositionnel profond ancré visuellement
Haozhan Shen Shilin Yan Hongwei Xue Shuaiqi Lu Xiaojun Tang Guannan Zhang Tiancheng Zhao Jianwei Yin
Résumé
Les modèles de langage multimodaux (MLLM) sont de plus en plus utilisés pour exécuter des workflows visuels, tels que la navigation dans des interfaces graphiques (GUI), où l'étape suivante dépend de conditions compositionnelles visuelles vérifiées (par exemple : « si une boîte de dialogue d'autorisation apparaît et que la couleur de l'interface est verte, cliquer sur Autoriser »), et où le processus peut se ramifier ou se terminer prématurément. Pourtant, cette capacité reste sous-évaluée : les benchmarks existants se concentrent sur des compositions superficielles ou des contraintes indépendantes, plutôt que sur des conditionnels compositionnels profondément chaînés. Dans cet article, nous introduisons MM-CondChain, un benchmark dédié au raisonnement compositionnel profond ancré visuellement. Chaque instance du benchmark est organisée sous la forme d'une chaîne de raisonnement multi-niveaux, où chaque niveau contient une condition compositionnelle non triviale, ancrée dans des preuves visuelles et construite à partir de plusieurs objets, attributs ou relations. Pour répondre correctement, un MLLM doit percevoir l'image en détail, raisonner sur plusieurs éléments visuels à chaque étape, et suivre le chemin d'exécution résultant jusqu'au résultat final. Afin de construire de manière évolutive des données de type workflow, nous proposons un pipeline de synthèse agentic : un planificateur orchestre la génération couche par couche des conditions compositionnelles, tandis qu'une Représentation Intermédiaire Programmatique Vérifiable (VPIR) garantit que la condition de chaque couche est vérifiable mécaniquement. Un compositeur assemble ensuite ces couches vérifiées en instructions complètes. Grâce à ce pipeline, nous construisons des benchmarks couvrant trois domaines visuels : images naturelles, graphiques de données et trajectoires d'interfaces graphiques. Les expériences menées sur une gamme de MLLMs montrent que même le modèle le plus performant n'atteint qu'un score de 53,33 en F1 de chemin, avec des baisses marquées sur les négatifs difficiles et une dégradation progressive à mesure que la profondeur ou la complexité des prédicats augmente. Ces résultats confirment que le raisonnement compositionnel profond demeure un défi fondamental.
One-sentence Summary
Researchers from Alibaba Group and Zhejiang University introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning that employs a VPIR-based agentic pipeline to generate mechanically verifiable, multi-layer conditional chains, revealing that even state-of-the-art multimodal models struggle with complex visual workflows requiring precise step-by-step verification.
Key Contributions
- Existing benchmarks fail to evaluate deep compositional reasoning because they focus on shallow single-layer compositions or independent constraints rather than multi-layer visual workflows where each step determines the execution path.
- The authors introduce MM-CondChain, a benchmark featuring nested conditional chains grounded in visual evidence, constructed via an agentic synthesis pipeline that uses a Verifiable Programmatic Intermediate Representation to ensure mechanical verifiability.
- Experiments across natural images, data charts, and GUI trajectories reveal that even the strongest multimodal models achieve only 53.33 Path F1, demonstrating significant performance drops as reasoning depth and predicate complexity increase.
Introduction
Multimodal Large Language Models are increasingly deployed in complex visual workflows like GUI navigation, where subsequent actions depend on verifying chained visual conditions. However, existing benchmarks fail to evaluate this capability because they focus on shallow, single-layer compositions or independent constraints rather than deep, multi-step reasoning paths that branch or terminate based on visual evidence. To address this gap, the authors introduce MM-CondChain, a benchmark featuring multi-layer control flow with mechanically verified hard negatives. They achieve scalable and reliable data construction through an agentic synthesis pipeline that uses a Verifiable Programmatic Intermediate Representation to decouple logical condition generation from natural language rendering.
Dataset
-
Dataset Composition and Sources: The authors construct MM-CondChain from three distinct visual domains using publicly available datasets. The Natural domain includes 398 images from SAM and GQA, the Chart domain features 200 chart images from ChartQA, and the GUI domain comprises 377 interaction trajectories (totaling 3,421 screenshots) sourced from AITZ.
-
Key Details for Each Subset:
- Natural: Focuses on object attributes and spatial relations.
- Chart: Concentrates on numerical and structural statistics across bar, line, and pie charts.
- GUI: Emphasizes action, state, and trajectory-level metadata with fine-grained reasoning annotations.
- Total Volume: The benchmark contains 975 evaluation samples, where each sample consists of a paired True-path and False-path instance.
-
Data Usage and Processing:
- Synthesis Pipeline: The authors employ Gemini-3-Pro to instantiate all agents in the synthesis pipeline, including the Planner, Verifier, Fact Extractor, and Translator.
- Subject De-leakage: An MLLM-based rewriter modifies subject descriptions to remove condition-revealing attributes while ensuring the subject remains uniquely referential to the target object.
- Paired-Path Instantiation: Each control-flow skeleton generates two nearly isomorphic instances. The True-path follows all conditions to a terminal layer, while the False-path swaps a single condition at a randomly sampled divergence layer to trigger early termination.
- Instruction Compilation: The system merges subjects and conditions into fluent natural-language if-clauses to create nested instructions that serve as hard negatives.
-
Evaluation Strategy:
- Metrics: Performance is measured using True-path Accuracy, False-path Accuracy, and Path F1 (the harmonic mean of the two), with an overall score calculated as the average Path F1 across domains.
- Setup: Models are evaluated in a zero-shot setting using default API parameters, with answers extracted from multiple-choice outputs based on specific formatting rules.
Method
The authors propose a VPIR-based agentic benchmark construction pipeline that decouples logical construction from language rendering to address logical inconsistencies in multi-layer compositional reasoning. The core framework accepts multimodal inputs, including natural images, chart images with metadata, and GUI trajectories with annotations. Refer to the framework diagram for the overall architecture.
The pipeline operates through an iterative, multi-layer reasoning chain coordinated by a Planner. At each layer t, the Planner selects a relational strategy rt to determine how the reasoning chain evolves, choosing between actions such as EXTEND, FINISH, or ROLLBACK. This control mechanism ensures that the chain depth remains within a target interval while maintaining coherence.
Once a layer is initiated, the system executes a four-stage synthesis workflow. First, a Fact Extractor grounds the generation in visual evidence by selecting a subject St and producing structured facts Ft as a typed key-value mapping. This structured representation prevents hallucination and defines a programmatic namespace. Second, the VPIR Generator synthesizes a Verifiable Programmatic Intermediate Representation, consisting of a true-logic predicate pt and a counterfactual false-logic p~t. These predicates are executable Python-like code evaluated in a sandboxed environment to ensure mechanical verifiability.
As shown in the figure below:

Third, a Translator renders the verified executable logic into natural language condition texts ct and c~t. This step ensures that truth values are anchored in code execution rather than linguistic generation. Finally, a Composer compiles the verified chain into paired benchmark instances. It constructs a True-path where all conditions hold and a False-path where a single condition is replaced by a minimally perturbed counterfactual, creating hard negatives that require precise visual grounding and deep compositional reasoning.
Experiment
- Main evaluation on MM-CondChain reveals that current multimodal large language models struggle with visually grounded deep compositional reasoning, with even top performers achieving only slightly above 50% average Path F1.
- Experiments comparing true versus false paths show a significant bias where models over-assume conditions hold, leading to high accuracy on valid paths but poor performance on invalid ones, which poses risks in real-world workflows.
- Domain analysis indicates that GUI tasks are the most challenging due to the need for multi-frame trajectory and state transition reasoning, whereas chart tasks are comparatively easier as they often reduce to deterministic numerical comparisons.
- Ablation studies on chain depth demonstrate that performance degrades consistently as the number of sequential verification layers increases, confirming that errors compound across layers rather than remaining isolated.
- Tests on predicate complexity reveal that increasing the logical operators and nesting within a single condition causes substantial performance drops, highlighting that models struggle with both sequential and intra-layer compositional reasoning.
- Overall, the findings establish that chain depth and predicate complexity are two orthogonal axes of difficulty that jointly define the limits of current models, making the benchmark a valuable diagnostic tool for identifying specific reasoning failures.