Command Palette
Search for a command to run...
CiteVQA: Benchmarking der Evidenzzuschreibung für vertrauenswürdige Dokumentenintelligenz
CiteVQA: Benchmarking der Evidenzzuschreibung für vertrauenswürdige Dokumentenintelligenz
Zusammenfassung
Titel:Abstract: Multimodale Large Language Models (MLLMs) haben das Dokumentenverständnis erheblich vorangetrieben, doch aktuelle Doc-VQA-Evaluierungen bewerten ausschließlich die Endergebnisantwort und lassen die stützenden Belege unberücksichtigt. Dieser rein auf die Antwort fokussierte Ansatz verschleiert eine kritische Fehlerart: Ein Modell kann zur korrekten Antwort gelangen, diese jedoch auf einen falschen Textabschnitt stützen – ein kritisches Risiko in hochsensiblen Bereichen wie Recht, Finanzen und Medizin, in denen jede Schlussfolgerung auf eine spezifische Quellregion zurückverfolgbar sein muss. Um diesem Problem zu begegnen, führen wir CiteVQA ein, einen Benchmark, der Modelle dazu verpflichtet, zu jeder Antwort elementgenaue Bounding-Box-Zitate zurückzugeben und dabei Antwort und Zitat gemeinsam zu bewerten. CiteVQA umfasst 1.897 Fragen zu 711 PDF-Dokumenten, die sich auf sieben Domänen und zwei Sprachen erstrecken, wobei durchschnittlich 40,6 Seiten pro Dokument vorliegen. Um die Genauigkeit und Skalierbarkeit zu gewährleisten, werden die Ground-Truth-Zitate durch eine automatisierte Pipeline generiert, die entscheidende Belege durch Maskierungsablation identifiziert, und anschließend durch Expertenreview validiert. Im Kern unserer Evaluierung steht die Strict Attributed Accuracy (SAA), die eine Vorhersage nur dann als korrekt wertet, wenn sowohl die Antwort als auch die zitierte Region richtig sind. Die Überprüfung von 20 MLLMs offenbart eine weit verbreitete Attribution Hallucination: Modelle produzieren häufig die richtige Antwort, zitieren jedoch die falsche Region. Das stärkste System (Gemini-3.1-Pro-Preview) erreicht eine SAA von lediglich 76,0, und das stärkste Open-Source-MLLM kommt gerade einmal auf 22,5. Letztlich deckt CiteVQA, im Bestreben nach vertrauenswürdiger Dokumentenintelligenz, eine Zuverlässigkeitslücke auf, die von rein auf Antworten basierenden Evaluierungen übersehen wird, und liefert die notwendigen Instrumente, um diese zu schließen. Unser Repository ist unter https://github.com/opendatalab/CiteVQA verfügbar.
One-sentence Summary
Contrasting with prior answer-only evaluations, CiteVQA advances trustworthy document intelligence by jointly assessing final answers and element-level bounding-box citations via Strict Attributed Accuracy, thereby exposing pervasive attribution hallucinations and providing rigorous instrumentation for high-stakes domains.
Key Contributions
- The paper introduces CiteVQA, a benchmark requiring multimodal models to return element-level bounding-box citations alongside final answers. The dataset comprises 1,897 questions across 711 multi-page PDFs spanning seven domains and two languages, with ground-truth citations generated via an automated masking ablation pipeline and validated by expert review.
- The work establishes Strict Attributed Accuracy (SAA), a metric that credits a prediction only when both the textual answer and the cited visual region are correct. This evaluation protocol enforces joint verification to overcome the reliability gaps inherent in conventional answer-only scoring.
- An audit of 20 multimodal large language models identifies a pervasive Attribution Hallucination phenomenon where systems frequently cite incorrect document regions despite producing correct answers. The baseline results show that the strongest closed-source system achieves an SAA of 76.0, while the top open-source model reaches 22.5.
Introduction
Document Visual Question Answering and evidence-based reasoning have become essential for high-stakes domains like healthcare and law, where preventing LLM hallucinations and ensuring verifiable information extraction are critical. Prior benchmarks, however, remain largely answer-centric and rely on coarse page-level annotations or inconsistent bounding box granularity without standardized evaluation protocols. Existing document intelligence systems also struggle with precise element-level grounding, while current metrics fail to verify reasoning paths or visual traceability in complex, multi-domain layouts. To address these gaps, the authors introduce CiteVQA, a cross-page framework that standardizes element-level bounding box citations and implements joint evaluation metrics. This approach uniquely measures both answer accuracy and structural traceability, enabling rigorous auditing of model reasoning against precise visual evidence in real-world documents.
Dataset
Dataset Composition and Sources
- The authors introduce CiteVQA, a benchmark comprising 1,897 questions derived from 711 PDF documents spanning seven domains and 30 sub-categories across two languages.
- Documents average 40.6 pages each and are sourced from Common Crawl, selected through a stratified sampling pipeline that filters over 100 million raw PDFs based on domain and language distribution.
- The dataset balances single-document tasks (52.0%) with multi-document scenarios, including cases with one gold document (25.7%) and multiple gold documents (22.3%).
- Each question requires an average of 2.57 evidence elements, with approximately 30% of evidence consisting of non-textual content such as tables, images, or equations.
Key Details and Subsets
- The benchmark covers diverse reasoning types ranging from complex synthesis to multimodal parsing, ensuring broad domain representation.
- Evidence is uniformly distributed across document positions and frequently spans multiple pages, requiring robust long-context aggregation capabilities.
- The dataset includes questions distilled from various open-source sources, processed through template generation to simulate real-world business scenarios.
- Human expert audits validate a subset of 200 instances, confirming appropriate question difficulty and high annotation quality.
Data Processing and Construction
- Construction relies on an automated pipeline that performs multi-document linking via semantic alignment and LLM-based metadata integration.
- Deep parsing utilizes MinerU2.5 to extract bounding box coordinates and OCR content, while MLLM agents navigate the parsed space to aggregate supporting facts into evidence packages.
- QA pairs are synthesized using template-driven distillation, where MLLMs select logical templates and generate questions based on evidence characteristics.
- Quality control includes answerability verification to ensure evidence sufficiency, paraphrasing for linguistic diversity, and a zero-document self-test to discard common-knowledge questions.
- Crucial evidence is identified through ablation-based masking, where elements are individually masked to verify their necessity for deriving the correct answer.
Usage and Evaluation Strategy
- The authors use CiteVQA as a rigorous evaluation benchmark rather than a training set, auditing 20 mainstream multimodal models.
- Evaluation centers on Strict Attributed Accuracy, which credits predictions only when both the answer and the cited region are correct.
- Additional metrics assess evidence coverage via Recall and logical alignment via Relevance to diagnose model behavior.
- The benchmark exposes a pervasive attribution hallucination phenomenon, where models produce correct answers grounded in incorrect evidence, with state-of-the-art models capping at 76.0 SAA.
Metadata and Cropping Specifications
- Metadata includes structured spatial coordinates and document identifiers, with bounding box coordinates provided as relative values ranging from 0 to 1000 on the page image.
- Page numbers in the metadata are indexed from 1, ignoring original page numbers from the source documents.
- Citation rules enforce element-level granularity, requiring evidence to correspond to complete paragraphs, tables, images, or notes rather than partial text or rows.
- Captions and footnotes for tables and images are annotated as separate evidence elements with distinct bounding boxes to ensure precise visual grounding.
- The output format requires bounding box tags to accompany cited evidence, enabling direct verification of the visual source for every claim.
Method
The framework for the CiteVQA system is composed of four primary stages: multi-document linking, evidence package extraction, QA construction, and quality control. The overall process begins with multi-document linking, where a filtered document pool undergoes semantic aggregation to form a linked document group. This stage leverages a semantic profiling mechanism to generate high-level descriptors for each document, which are then encoded into normalized vectors. For an anchor document, the top-K candidate documents are selected based on cosine similarity, forming a candidate pool that ensures only contextually relevant documents proceed to fine-grained analysis.
As shown in the figure below, the fine-grained alignment process employs a large language model (LLM) to perform chain-of-thought reasoning across section units from both the anchor and candidate documents. The model identifies logical bridges between documents by analyzing their structural hierarchy and outputs structured association groups, each containing an anchor section, a candidate section, a similarity score, and a rationale. The system retains the top matches based on scores and filters out unreliable associations, ensuring high information density and reducing noise.
The second stage, evidence package extraction, involves parsing documents to collect high-quality, verifiable evidence bundles. This is achieved through a multi-step process that includes document parsing and agent exploration. The system extracts OCR text, bounding boxes, and logical relations to form evidence packages. Each package must satisfy specific criteria: it must span at least two pages, include at least two element types (such as text, tables, figures, or layout), and provide complete context for any extracted elements. The output is a list of evidence bundles, each containing a description and a collection of relevant elements.
In the QA construction phase, question collection and template distillation are performed to synthesize QA pairs. The system uses templates derived from the collected questions to generate structured QA pairs, ensuring that the generated answers are grounded in the extracted evidence. The final stage, quality control, involves QA verification and paraphrasing to ensure the accuracy and coherence of the generated responses. This includes evidence ablation to assess the impact of crucial evidence and to ensure that the generated answers are not overly reliant on non-essential information.
The framework is designed to maintain a balance between preserving fine-grained document details and adhering to the architectural limits of diverse model families. The input resolution is standardized to 1024×1024 pixels, which represents a critical saturation point for most current multimodal large language models (MLLMs). This resolution ensures that precise localization is maintained while avoiding the limitations imposed by context constraints. The inference settings are unified across experiments, with a maximum output length of 4,096 tokens and the use of specific model configurations to maximize reasoning capability. The deployment infrastructure utilizes 8×NVIDIA H200 GPUs to ensure consistent latency and sufficient VRAM for high-resolution document processing.
Experiment
The evaluation assesses twenty advanced multimodal language models on the CiteVQA benchmark to validate their capacity for accurate question answering alongside trustworthy spatial grounding and evidence attribution across diverse document formats. The experiments reveal a pervasive attribution hallucination where models frequently produce correct answers but fail to precisely locate or cite the supporting evidence, with proprietary systems significantly outperforming open alternatives that struggle with basic page navigation. Performance deteriorates sharply in cross-document and complex layout scenarios, yet the strong positive correlation between evidence quality and answer accuracy indicates that enhancing autonomous spatial localization is fundamental to improving both reasoning capabilities and reliability in professional applications.
The authors evaluate evidence attribution in multimodal language models using a set of metrics that assess both answer correctness and grounding quality. Results show a significant gap between answer accuracy and strict attributed accuracy across all models, indicating a pervasive issue where models can generate correct answers without correctly linking them to supporting evidence. Performance varies widely by model type, with closed-source models outperforming open-source ones, and the difficulty of attribution increases substantially in multi-document settings. Models often achieve high answer accuracy but fail to properly ground their responses in specific evidence, a phenomenon referred to as 'Attribution Hallucination'. Closed-source models significantly outperform open-source models in evidence attribution, with a substantial performance gap observed across all metrics. Attribution becomes markedly harder in multi-document scenarios, where even top models show significant drops in localization and recall performance.
The experiment evaluates multimodal large language models on evidence attribution tasks using a dataset with diverse document types, question types, and evidence sources. Results show significant performance gaps between models, particularly in linking answers to correct document locations, with many models failing to locate relevant pages or accurately cite evidence despite generating correct answers. Models often fail to locate the correct document pages, indicating a fundamental challenge in coarse-grained attribution. A discrepancy exists between answer correctness and evidence attribution, with many models achieving high answer accuracy but low attribution scores. Performance varies significantly by question type, with quantitative reasoning tasks being easier than multimodal parsing, which requires precise evidence localization.
The the the table presents a comprehensive evaluation of various multimodal large language models across different document scenarios, highlighting significant performance disparities between closed-source and open-source models. Results show that closed-source models generally outperform open-source models in evidence attribution, with a notable gap in strict attributed accuracy, indicating a common issue of attribution hallucination where models provide correct answers but fail to ground them properly. Performance degrades substantially in multi-document settings compared to single-document tasks, particularly for open-source models, and the ability to locate the correct page is a major bottleneck across all model categories. Closed-source models significantly outperform open-source models in evidence attribution, especially in multi-document scenarios. A widespread gap exists between answer correctness and strict attributed accuracy, indicating a common issue of attribution hallucination. Locating the correct document page is a major challenge, with performance dropping sharply in multi-document settings across all models.
The authors evaluate the performance of models on evidence attribution tasks using automated judges and compare their scores against human expert ratings. Results show that automated judges produce scores that are statistically indistinguishable from human evaluations across both relevance and answer correctness metrics, indicating the reliability of the automated evaluation pipeline. The analysis further reveals that models exhibit varying levels of performance, with some achieving high answer correctness but lower relevance scores, suggesting a discrepancy between accurate answers and faithful evidence grounding. Automated judges produce scores that are not statistically different from human expert ratings, validating the reliability of the evaluation method. Models show a performance gap between answer correctness and relevance, indicating a disconnect between generating correct answers and providing well-grounded evidence. GPT-5.4 and Gemini-3.1-Pro achieve high answer correctness scores but differ in relevance, highlighting varying strengths in evidence attribution.
The authors evaluate evidence attribution in multimodal large language models using a comprehensive set of metrics that assess both answer correctness and grounding quality. Results show a significant gap between answer accuracy and strict attributed accuracy across all models, indicating a pervasive issue where models can generate correct answers without properly linking them to the supporting evidence. Performance varies widely by model type, with closed-source models outperforming open-source ones, and the task becomes substantially harder in multi-document settings due to challenges in both page-level navigation and precise evidence localization. Models often achieve high answer correctness but fail to attribute evidence correctly, indicating a widespread attribution hallucination problem. Closed-source models significantly outperform open-source models in evidence attribution, with a notable performance gap in strict attributed accuracy. Multi-document scenarios drastically reduce performance, particularly in page-level recall and evidence localization, highlighting challenges in cross-document reasoning.
The experiments evaluate multimodal large language models on evidence attribution tasks using diverse document types and question formats, with automated scoring validated against human expert ratings to ensure reliability. Results consistently reveal a pronounced disconnect between answer correctness and strict evidence grounding, highlighting a widespread phenomenon where models generate accurate responses without properly citing supporting material. While closed-source architectures generally surpass open-source counterparts, performance degrades substantially in multi-document environments, underscoring significant challenges in cross-document navigation and precise localization.