HyperAIHyperAI

Command Palette

Search for a command to run...

CiteAudit : Vous l’avez cité, mais l’avez-vous lu ? Une benchmark pour vérifier les références scientifiques à l’ère des LLM

Zhengqing Yuan Kaiwen Shi Zheyuan Zhang Lichao Sun Nitesh V. Chawla Yanfang Ye

Résumé

La recherche scientifique repose sur une citation précise pour assurer l’attribution correcte et la intégrité du travail, mais les grands modèles linguistiques (LLM) introduisent un nouveau risque : des références fictives qui semblent crédibles mais ne correspondent à aucune publication réelle. Ces citations hallucinées ont déjà été observées dans des soumissions et des articles acceptés lors de grandes conférences en apprentissage automatique, mettant en évidence des vulnérabilités dans le processus de relecture par les pairs. Par ailleurs, la croissance rapide des listes de références rend la vérification manuelle impraticable, et les outils automatisés existants restent fragiles face aux formats de citation bruyants et hétérogènes, tout en manquant de critères d’évaluation standardisés. Nous présentons la première infrastructure complète de benchmark et de détection des citations hallucinées dans l’écriture scientifique. Notre pipeline de vérification multi-agents décompose le contrôle des citations en plusieurs étapes : extraction de l’affirmation, récupération de l’élément probant, correspondance de passages, raisonnement et jugement calibré, afin d’évaluer si une source citée soutient effectivement l’assertion formulée. Nous avons constitué un grand ensemble de données validées par des humains couvrant plusieurs domaines, et défini des métriques unifiées pour mesurer la fidélité des citations et l’alignement des preuves. Des expériences menées avec les meilleurs modèles linguistiques actuels révèlent des erreurs substantielles dans les citations, et montrent que notre cadre dépasse significativement les méthodes antérieures en termes de précision et d’interprétabilité. Ce travail constitue la première infrastructure évolutif pour auditer les citations à l’ère des LLM, offrant des outils concrets pour renforcer la fiabilité des références scientifiques.

One-sentence Summary

Researchers from Notre Dame and Lehigh introduce APM, a multi-agent framework detecting hallucinated citations via staged verification, outperforming baselines with interpretable, scalable auditing for LLM-generated scientific texts, addressing integrity risks in peer review and scholarly publishing.

Key Contributions

  • We introduce the first large-scale, human-validated benchmark for detecting hallucinated citations, covering diverse domains and citation types with standardized evaluation protocols to enable reproducible research on citation faithfulness.
  • We propose a multi-agent verification framework that decomposes citation checking into coordinated stages—claim extraction, evidence retrieval, passage matching, contextual reasoning, and calibrated judgment—to handle noisy, real-world citation formats robustly.
  • Experiments on state-of-the-art LLMs reveal widespread citation errors, and our framework significantly outperforms existing baselines in both detection accuracy and interpretability, offering scalable tools for reviewers and publishers.

Introduction

The authors leverage the rise of large language models in scientific writing to address a growing threat: hallucinated citations—plausible but nonexistent references that undermine research integrity. Prior automated tools struggle with real-world citation noise and lack standardized benchmarks, while manual verification is no longer feasible at scale. Their main contribution is a multi-agent verification framework that breaks citation checking into specialized stages—claim extraction, evidence retrieval, matching, reasoning, and judgment—paired with the first large-scale, human-validated benchmark spanning diverse citation types and domains. This system outperforms existing baselines in accuracy and interpretability, offering a practical, scalable infrastructure for researchers and publishers to audit citations and restore trust in scholarly references.

Dataset

The authors use CiteAudit, a benchmark for citation hallucination detection, composed of two main components: real-world citations and human-synthesized hallucinated citations. Here’s how the dataset is built and used:

  • Sources and Composition:

    • Real citations are drawn from OpenReview and Google Scholar, manually verified against authoritative records (title, authors, venue, year, DOI).
    • Hallucinated citations are generated from verified BibTeX entries using controlled perturbations guided by a taxonomy of hallucination types (title, author, metadata errors).
  • Key Subset Details:

    • Real-world set: High-quality, naturally occurring errors (e.g., wrong authors, venues, nonexistent papers), manually verified but limited in scale due to labor intensity.
    • Generated set (2,500 instances):
      • Title Errors (1,000): Generated via keyword substitution, paraphrasing, or GPT-4o-mini topic-conditioned synthesis.
      • Author Errors (1,000): Via adding/deleting authors, name swaps, or full fabrication.
      • Metadata Errors (500): Venue mismatches, year shifts, or fake DOIs.
    • Compound hallucinations (multiple field errors) are always labeled as hallucinated.
  • Processing and Annotation:

    • All citations undergo automated web retrieval followed by manual cross-checking by the author team.
    • Labeling is strict: only core metadata (title, authors, venue, year, DOI) must match authoritative sources; minor formatting issues are ignored.
    • Each citation is reviewed by at least two authors; unresolved cases are excluded. A random subset is audited for consistency.
  • Dataset Use in Evaluation:

    • The generated test set (5,000 total: 2,500 real + 2,500 hallucinated) is balanced across error types to enable fair, fine-grained model evaluation.
    • Generated hallucinations are validated to match real-world error distributions (chi-square test: p ≈ 0.97).
    • No automated or crowd-sourced labeling is used — all annotations are author-verified for high confidence.

This design supports reproducible, controlled evaluation of citation verification systems under both naturally occurring and systematically perturbed hallucination scenarios.

Method

The authors leverage a decentralized multi-agent framework governed by a hierarchical Standardized Operating Procedure (SOP) to systematically audit scholarly citations for existence and metadata integrity. The system treats citation verification as a multi-stage evidence validation problem, where each citation rir_iri is evaluated against a strict consistency criterion ScS_cSc defined over its structured metadata tuple Mi={mT,mA,mU,mV}M_i = \{m_T, m_A, m_U, m_V\}Mi={mT,mA,mU,mV}, representing title, authors, URL, and venue. A citation is classified as Fake if no corresponding entry exists in the global scholarly graph Gscholar\mathcal{G}_{\text{scholar}}Gscholar or if Sc=0S_c = 0Sc=0, where ScS_cSc is computed as the product of indicator functions over exact character-level matches between extracted and ground-truth fields:

Sc=k{T,A,U,V}I(mk=m^k)S _ { c } = \prod _ { k \in \{ T , A , U , V \} } \mathbb { I } ( m _ { k } = \hat { m } _ { k } )Sc=k{T,A,U,V}I(mk=m^k)

The framework is orchestrated by an LLM Controller acting as the SOP Executor, which decomposes the verification task into a sequential and parallelizable graph based on predefined stages (S1–S4). The pipeline begins with the Extractor Agent (Aext\mathcal{A}_{ext}Aext), which ingests raw PDFs using vision-integrated OCR tools and maps unstructured citation strings into immutable JSON metadata, preserving the original authorial intent without semantic distortion. This structured output serves as the substrate for all downstream verification.

Refer to the framework diagram for a visual representation of the agent interactions and decision flow. The Memory Agent (Amem\mathcal{A}_{mem}Amem) then performs a high-speed semantic lookup against a dual-end knowledge base K\mathcal{K}K, computing a confidence score smems_{mem}smem via cosine similarity between embeddings of the citation and stored records:

smem(Mi)=maxkK(Enc(Mi)Enc(k)Enc(Mi)Enc(k))s _ { m e m } ( M _ { i } ) = \operatorname* { m a x } _ { k \in \mathcal { K } } \left( \frac { E n c ( M _ { i } ) \cdot E n c ( k ) } { | | E n c ( M _ { i } ) | | \cdot | | E n c ( k ) | | } \right)smem(Mi)=kKmax(∣∣Enc(Mi)∣∣∣∣Enc(k)∣∣Enc(Mi)Enc(k))

If smem>τs_{mem} > \tausmem>τ (with τ=0.92\tau = 0.92τ=0.92), the citation is immediately verified via the “fast-path” and stored in the Memory Pool for future reuse. Otherwise, the task is allocated to the Web Search Agent (Aweb\mathcal{A}_{web}Aweb), which interfaces with the Google Search API to retrieve and deep-crawl the full content of the top five results, ensuring evidence is grounded in actual textual data rather than snippets.

The Judge Agent (Ajud\mathcal{A}_{jud}Ajud) then evaluates the alignment between the extracted metadata MiM_iMi and the retrieved evidence E\mathcal{E}E using the strict verification function:

Fjudge(Mi,E)=f{T,A,U,V}I(ExactMatch(Mif,E))\mathcal { F } _ { j u d g e } ( M _ { i } , \mathcal { E } ) = \prod _ { f \in \{ T , A , U , V \} } \mathbb { I } ( \mathrm { E x a c t M a t c h } ( M _ { i } ^ { f } , \mathcal { E } ) )Fjudge(Mi,E)=f{T,A,U,V}I(ExactMatch(Mif,E))

A successful match triggers storage in the Memory Pool; a mismatch escalates the citation to the Scholar Agent (Asch\mathcal{A}_{sch}Asch), which performs low-frequency, high-precision crawling of authoritative repositories (e.g., Google Scholar) to retrieve the canonical ground-truth record M^i\hat{M}_iM^i. The Judge Agent performs a final character-level alignment against this record. If this definitive check fails, the citation is flagged as Hallucinated with a provenance report; otherwise, it is verified and stored.

The entire pipeline is implemented using the Qwen3-VL-235B A22 model deployed via vLLM, with agents instantiated for specific roles: the Planning Model for task routing, Aext\mathcal{A}_{ext}Aext for OCR and schema mapping, Amem\mathcal{A}_{mem}Amem based on the Mem0 framework for persistent knowledge retention, Aweb\mathcal{A}_{web}Aweb for deep web crawling, Ajud\mathcal{A}_{jud}Ajud for strict matching, and Asch\mathcal{A}_{sch}Asch for authoritative retrieval. The system operates on a multi-thread pool (size=4) with deterministic inference (temperature=0.0) and employs a structured prompt-based SOP to ensure reproducible, auditable decision-making. Role separation between the Planning Agent (task routing only) and Judge Agent (strict matching only) enforces determinism and prevents heuristic conflation.

Experiment

  • Our model excels in detecting hallucinated citations without falsely rejecting real ones, outperforming existing systems by enforcing strict authenticity over permissive plausibility.
  • On real-world benchmarks, it achieves the highest accuracy, precision, recall, and F1 score, significantly surpassing alternatives and proving robust under noisy, ambiguous conditions.
  • Ablation studies confirm critical roles: the Scholar Agent acts as a final safety net for authoritative validation, the LLM Judge enables semantic resilience against formatting noise, and the Web Search Agent ensures efficiency via fast-path filtering.
  • Proprietary LLMs underperform due to opaque, unreliable retrieval behavior, highlighting the need for transparent, evidence-grounded verification tools.
  • Case studies demonstrate fine-grained detection of metadata mismatches (e.g., title, author), offering interpretable, structured verification beyond binary classification, crucial for scholarly integrity.

The authors use a generated benchmark and real-world test set to evaluate citation verification models, revealing that most existing systems struggle to balance detecting hallucinated citations with preserving genuine ones. Their proposed model achieves near-perfect detection of fabricated references while maintaining low false positives on real citations, outperforming baselines in both accuracy and cost efficiency. Ablation studies confirm that each component—web search, semantic reasoning, and authoritative verification—plays a critical role in achieving this robust and efficient performance.

The authors evaluate their citation verification model against several baselines on a generated benchmark, showing it achieves near-perfect recall with zero false negatives and maintains high precision, outperforming others in balancing hallucination detection and real citation preservation. Their model also incurs no API cost and processes references as quickly as the most efficient baselines, leveraging lightweight agents and external tools rather than relying on expensive LLM inference. Results confirm the system’s ability to enforce strict authenticity constraints while remaining cost-effective and scalable.

The authors evaluate their citation verification framework against ablated variants, showing that removing the Scholar Agent significantly reduces recall, while replacing the LLM Judge with string matching drastically lowers precision and F1 score. Disabling the Web Search Agent increases processing time eightfold, confirming its role in maintaining efficiency. These results demonstrate that each component contributes uniquely to the system’s balanced accuracy, reliability, and speed.

The authors evaluate their citation verification model against several baselines on a generated benchmark, showing it achieves near-perfect recall with zero false negatives while maintaining high precision and low false positives. Their model also outperforms commercial LLMs in cost efficiency, incurring no monetary cost and matching or exceeding speed, due to its architecture that limits LLM use to high-level judgment while offloading verification to lightweight agents. Results indicate the model enforces strict citation authenticity rather than relying on plausibility, leading to more reliable and interpretable verification in both controlled and real-world settings.

The authors evaluate citation verification models across two test sets: a generated benchmark with controlled hallucinations and a real-world set with naturally occurring errors. Results show that existing models struggle to balance detecting fake citations with preserving real ones, while their proposed system achieves high accuracy and reliability by combining lightweight agents with authoritative verification. The framework also maintains efficiency and cost-effectiveness by limiting heavy LLM usage to final judgment rather than full retrieval.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp