HyperAIHyperAI

Command Palette

Search for a command to run...

戦略的ナビゲーションか、確率的探索か?エージェントと人間がドキュメントコレクションに対してどのように推論するか

概要

マルチモーダルエージェントは、複雑で文書集約的なワークフローの自動化に向けた有望な道筋を提供する。しかし、重要な問いが残されている:これらのエージェントは真の戦略的推論を示しているのか、それとも単なる確率的な試行錯誤の探索に過ぎないのか。この課題に対処するため、800 の多様な PDF ドキュメントに基づき、2,250 の人間作成による質問からなるベンチマーク「MADQA」を提案する。古典的テスト理論(Classical Test Theory)に導かれ、エージェント能力の異なるレベル間での識別力を最大化するように設計されている。エージェントの振る舞いを評価するため、精度と努力量のトレードオフを測定する新規評価プロトコルを導入する。この枠組みを用いた検討により、最良のエージェントは生粋の精度において人間の検索者と同等の性能を発揮し得るものの、成功する問題の傾向が人間とは大きく異なり、戦略的計画の弱さを補うために蛮力的な探索に依存していることが示された。また、これらのエージェントはオラクル性能との約 20% に及ぶギャップを埋めることができず、非生産的なループに陥り続けている。本研究では、蛮力的な検索から較正された効率的な推論への移行を促進するため、データセットと評価ハarnessを公開する。

One-sentence Summary

Researchers from Snowflake and collaborating institutions introduce MADQA, a benchmark of 2,250 questions across 800 PDFs, to reveal that current multimodal agents rely on brute-force search rather than strategic reasoning, highlighting a critical gap in efficient document-intensive workflow automation.

Key Contributions

  • The paper addresses the uncertainty of whether multimodal agents possess genuine strategic reasoning or rely on stochastic trial-and-error by formally defining Agentic Document Collection VQA with six core properties.
  • It introduces MADQA, a benchmark of 2,250 human-authored questions across 800 heterogeneous PDFs designed using Classical Test Theory to maximize discriminative power across varying agentic abilities.
  • A novel evaluation protocol measuring the accuracy-effort trade-off reveals that top agents match human raw accuracy but fail to close a 20% gap to oracle performance due to reliance on brute-force search and unproductive loops.

Introduction

Multimodal agents are increasingly deployed to automate complex, document-intensive workflows, yet it remains unclear whether they employ genuine strategic reasoning or rely on stochastic trial-and-error search. Prior benchmarks often focus on single documents, use semi-automated annotations, or evaluate agents on web pages rather than heterogeneous PDF collections, failing to capture the iterative planning required for real-world tasks. To address these gaps, the authors introduce MADQA, a rigorously validated benchmark of 2,250 human-authored questions across 800 diverse PDFs, alongside a novel evaluation protocol that measures the accuracy-effort trade-off. Their analysis reveals that while top agents match human accuracy, they achieve this through brute-force search and unproductive loops rather than calibrated strategic planning, highlighting a critical need to shift from retrieval-heavy approaches to efficient reasoning.

Dataset

MADQA Dataset Overview

The authors introduce the Multimodal Agentic Document QA (MADQA) benchmark to evaluate multimodal large language models on complex, multi-stage information retrieval and reasoning tasks within enterprise settings.

  • Dataset Composition and Sources

    • The corpus consists of 800 manually curated PDFs sourced from DocumentCloud, covering 13 high-level domains and 63 fine-grained categories.
    • Documents include diverse real-world materials such as financial reports, legal filings, government forms, and technical manuals, ranging from single-page summaries to 800+ page filings.
    • The collection emphasizes layout heterogeneity, featuring high table density in financial documents, figure-heavy technical reports, and text-dense legal records.
  • Key Details for Each Subset

    • Total Size: The dataset contains 2,250 human-authored question-answer pairs grounded strictly in the provided documents.
    • Reasoning Types: Approximately 17.3% of questions require multi-hop reasoning, with 8.3% needing cross-page synthesis within a single document and 9.0% requiring cross-document aggregation.
    • Evidence Granularity: Annotations specify a minimal evidence set at the page level rather than bounding boxes, aligning with standard retrieval system operations.
    • Quality Control: A rigorous pipeline involving over 1,200 hours of professional work ensured solvability and lack of ambiguity, with automated checks using GPT-5 and manual review by domain experts.
  • Data Usage and Splits

    • Training Set: Contains 1,550 samples with released annotations to facilitate reinforcement learning-based optimization.
    • Development Set: Includes 200 samples released with ground truth for model tuning.
    • Test Set: Comprises 500 samples with hidden labels for leaderboard evaluation.
    • Split Strategy: The authors apply Classical Test Theory to select items based on difficulty and discrimination power. The test set includes a "Sentinel Pool" of 100 items that current models cannot solve to ensure the benchmark retains headroom for future model improvements.
  • Processing and Construction Details

    • Document Clustering: The authors intentionally curated clusters of up to 30 related documents (e.g., sequential reports) to enable realistic cross-document multi-hop questions.
    • Annotation Protocol: Annotators were restricted from using external world knowledge and instructed to create questions that are unanswerable without the provided corpus.
    • Human Baseline: A custom web interface with a BM25 search engine was used to collect human baselines, logging search trajectories and navigation actions to compare human and agent retrieval strategies.
    • Bias Mitigation: The dataset is English-only and primarily sourced from the United States, with a specific focus on public records that may contain personally identifiable information.

Method

The authors formally define Agentic Document Collection Visual Question Answering as a task requiring systems to navigate, retrieve, reason over, and aggregate information from heterogeneous document collections. Given a corpus C\mathcal{C}C of multi-page PDF documents and a natural language query qqq, the task is to produce an answer aaa and a minimal evidence set E\mathcal{E}E of pages. The framework operates through an iterative cycle of decomposition, retrieval, and analysis. Refer to the framework diagram for a visual representation of this process, which illustrates how a question is decomposed, relevant documents are retrieved from the corpus, and the information is analyzed to generate an attributed answer.

This task is characterized by six formal properties that distinguish it from standard document QA. First, the task is extractive, meaning answer tokens are drawn directly from the evidence pages rather than generated abstractly. Second, it supports multi-hop reasoning where the evidence set may comprise multiple disjoint pages requiring aggregation. Third, it operates under a closed-world assumption, deriving answers solely from the corpus without relying on parametric world knowledge. Fourth, it requires grounded attribution, ensuring the answer is faithfully entailed by the minimal evidence set. Fifth, the task is agentic, necessitating iterative retrieval and planning that cannot be solved in a single forward pass. Sixth, it is visual, requiring comprehension of non-textual modalities such as spatial layout, table structure, and figures.

To address this task, the authors implement a search-augmented agent baseline that combines text-based retrieval with vision-language model (VLM) reasoning. The agent iteratively searches a document collection and analyzes retrieved page images. A full-text search index is constructed from OCR-extracted text using the Whoosh search library. The agent operates in a loop, equipped with a search_documents tool that returns rendered images of matching pages. This allows the agent to leverage the VLM's visual understanding for layout-sensitive documents. The agent produces structured outputs containing answer strings and citations.

Another baseline involves an agentic approach using the Claude Agents SDK integrated with semtools. This agent has access to composable Unix-style utilities for parsing, searching, and managing document workspaces. As shown in the figure below, the agent uses a specific user prompt and configuration to execute bash pipelines and interpret search results.

The authors also employ Recursive Language Models (RLMs) as a task-agnostic inference paradigm. This framework enables models to handle long contexts by programmatically examining and decomposing the input within a REPL environment. The document corpus is loaded as a variable, and the model can spawn recursive sub-LLM calls to process subsets of the context.

To analyze the questions, the authors utilize classification prompts. One prompt classifies questions into categories such as yes_no, binary_choice, or other based on the answer type. As shown in the figure below, this classification helps determine the complexity and required reasoning steps.

Additionally, a question modality classifier determines whether visual modality is required. This prompt categorizes questions based on visual requirements, such as free text, table structure, chart interpretation, or spatial layout. As shown in the figure below, these definitions quantify the importance of layout understanding and visual artifacts.

Evaluation is performed using an LLM Judge prompt. This prompt evaluates answer correctness based on semantic equivalence to gold variants. The criteria distinguish between correct, partial, and incorrect answers. As shown in the figure below, the evaluation steps include checking for refusal, comparing content, and checking for critical errors like missing scale qualifiers.

The evaluation process continues with checks for format and verbosity. As shown in the figure below, the judge follows a step-by-step analysis to provide a final judgment, ensuring that answers are concise and adhere to the expected output format.

Further evaluation rules enforce strict formatting, such as returning answers as lists of short strings without full sentences. As shown in the figure below, the expected output format includes the answer, citations, and search history.

Finally, the authors analyze the efficiency of the retrieval process using the Kuiper Statistic. This metric measures the cumulative difference in performance relative to the effort expended. As shown in the figure below, the graph illustrates an efficient region where performance gains are high, followed by a region of diminishing returns.

Experiment

  • Construct validity experiments confirm that the benchmark requires semantic reasoning and visual comprehension rather than simple lexical matching or reliance on parametric knowledge, as keyword-based retrieval yields low precision and models can only guess a small fraction of answers without document evidence.
  • Visual analysis reveals that over half of the questions depend on understanding structured layouts, tables, or visual artifacts, demonstrating that text-only approaches are insufficient for the majority of tasks.
  • Evaluation of agentic systems shows that iterative planning significantly outperforms static retrieval methods, though retrieval remains the primary bottleneck, with top models still trailing human performance by a substantial margin even with perfect search tools.
  • Error decomposition indicates that while weaker models fail primarily due to retrieval issues or premature refusals, stronger models shift toward comprehension failures, suggesting that finding the correct document is becoming easier than extracting the precise answer.
  • Calibration studies demonstrate that humans allocate search effort more efficiently than current agents, who often expend excessive compute on difficult queries without recognizing when to stop, highlighting a gap in strategic reasoning and self-correction.
  • Multi-hop reasoning analysis finds that semantic distance between evidence sources is a stronger predictor of difficulty than physical page proximity, and cross-document questions are often easier than same-document ones due to clearer structural boundaries.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています