HyperAIHyperAI

Command Palette

Search for a command to run...

A-RAG: Skalierung agenter abfragender erweiterter Generierung mittels hierarchischer Abruf-Schnittstellen

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

Zusammenfassung

Frontier-Sprachmodelle haben starke Schlussfolgerungsfähigkeiten und Fähigkeiten für langfristige Werkzeugnutzung demonstriert. Bestehende RAG-Systeme nutzen diese Fähigkeiten jedoch nicht aus. Sie basieren weiterhin auf zwei Paradigmen: (1) der Entwicklung eines Algorithmus, der Passage in einem einzigen Schritt abruft und diese anschließend an das Modell als Eingabe zusammenfügt, oder (2) der vordefinierten Workflow-Struktur, bei der das Modell schrittweise angeleitet wird, diesen auszuführen. Weder das eine noch das andere Paradigma ermöglicht es dem Modell, an Entscheidungen zum Abruf von Informationen teilzunehmen, wodurch eine effiziente Skalierung mit Fortschritten im Modell verhindert wird. In diesem Paper stellen wir A-RAG vor, einen agierenden RAG-Framework, der dem Modell hierarchische Abruf-Schnittstellen direkt zugänglich macht. A-RAG bietet drei Abrufwerkzeuge: Schlüsselwort-Suche, semantische Suche und Abschnitts-Lesung, wodurch das Agens adaptiv Informationen über mehrere Granularitätsstufen hinweg suchen und abrufen kann. Experimente auf mehreren offenen Domänen-QA-Benchmarks zeigen, dass A-RAG bestehende Ansätze konsistent übertrifft, wobei vergleichbare oder geringere Anzahl an abgerufenen Tokens verwendet werden. Dies belegt, dass A-RAG die Fähigkeiten des Modells effektiv nutzt und sich dynamisch an unterschiedliche RAG-Aufgaben anpasst. Wir untersuchen zudem systematisch, wie A-RAG mit wachsender Modellgröße und steigendem Rechenaufwand bei der Testzeit skaliert. Wir werden unseren Code und die Evaluierungssuite zur Förderung zukünftiger Forschung veröffentlichen. Der Code und die Evaluierungssuite sind verfügbar unter: https://github.com/Ayanami0730/arag.

One-sentence Summary

Researchers from USTC and Metastone Technology propose A-RAG, an agentic RAG framework with hierarchical retrieval tools (keyword_search, semantic_search, chunk_read) that enable LLMs to autonomously adapt search strategies across granularities, outperforming prior methods on multi-hop QA benchmarks while using fewer tokens and scaling efficiently with model size and test-time compute.

Key Contributions

  • A-RAG addresses the limitation of static RAG systems by enabling LLMs to dynamically control retrieval through hierarchical tools—keyword_search, semantic_search, and chunk_read—allowing adaptive, multi-granularity information gathering aligned with the model’s reasoning process.
  • Evaluated on open-domain QA benchmarks, A-RAG consistently outperforms prior methods while using comparable or fewer retrieved tokens, validating that agent-driven retrieval leverages model capabilities more effectively than predefined workflows or single-shot retrieval.
  • Systematic scaling experiments show A-RAG’s performance improves with larger models and increased test-time compute, demonstrating its efficiency in scaling alongside advances in LLM capabilities and computational resources.

Introduction

The authors leverage the growing reasoning and tool-use capabilities of frontier LLMs to rethink Retrieval-Augmented Generation (RAG), which has historically relied on static, algorithm-driven retrieval or rigid, predefined workflows that limit model autonomy. Prior approaches—including Graph RAG and Workflow RAG—fail to let the model dynamically adapt its retrieval strategy based on context or task complexity, preventing efficient scaling with model improvements. The authors’ main contribution is A-RAG, an agentic framework that exposes hierarchical retrieval tools (keyword_search, semantic_search, chunk_read) directly to the model, enabling it to autonomously navigate information at multiple granularities. Experiments show A-RAG outperforms existing methods with fewer retrieved tokens and scales effectively with model size and test-time compute, proving that agent-driven retrieval interfaces are more powerful than fixed retrieval algorithms.

Dataset

  • The authors use only publicly available benchmarks previously curated and processed by prior research, ensuring ethical compliance.
  • No new data is collected, and no human subjects are involved in this work.
  • The focus is on advancing retrieval-augmented generation (RAG) in large language models, with no added ethical risks beyond those already present in the base models.
  • Dataset composition and processing follow established practices from prior studies, without introducing novel filtering, cropping, or metadata construction steps.

Method

The authors leverage a minimalist yet powerful agent-centric architecture called A-RAG, which exposes hierarchical retrieval interfaces to enable autonomous, iterative information gathering. The framework is built around three core components: a lightweight hierarchical index, a suite of granular retrieval tools, and a simple ReAct-style agent loop that facilitates dynamic strategy selection and interleaved tool use.

The hierarchical index is constructed in two stages: chunking and embedding. The corpus is partitioned into approximately 1,000-token chunks aligned with sentence boundaries to preserve semantic coherence. Each chunk is then decomposed into constituent sentences, and each sentence is embedded using a pre-trained encoder fembf_{\text{emb}}femb, yielding vector representations vi,j=femb(si,j)\mathbf{v}_{i,j} = f_{\text{emb}}(s_{i,j})vi,j=femb(si,j). This sentence-level embedding enables fine-grained semantic matching while preserving the mapping to parent chunks. Crucially, keyword-level retrieval is handled at query time via exact text matching, avoiding costly offline indexing. This yields a three-tiered representation: implicit keyword-level for precise entity matching, sentence-level for semantic search, and chunk-level for full content access.

To interface with this index, the authors design three retrieval tools operating at different granularities. The keyword search tool accepts a list of keywords K={k1,k2,,km}\mathcal{K} = \{k_1, k_2, \ldots, k_m\}K={k1,k2,,km} and returns top-kkk chunks ranked by a weighted frequency score:

Scorekw(ci,K)=kKcount(k,Ti)k\mathrm{Score}_{\mathrm{kw}}(c_i, \mathcal{K}) = \sum_{k \in \mathcal{K}} \mathrm{count}(k, T_i) \cdot |k|Scorekw(ci,K)=kKcount(k,Ti)k

where longer keywords are weighted higher for specificity. For each matched chunk, it returns an abbreviated snippet containing only sentences that include at least one keyword:

Snippet(ci,K)={sSent(ci)kK,ks}\mathrm{Snippet}(c_i, \mathcal{K}) = \{ s \in \mathrm{Sent}(c_i) \mid \exists k \in \mathcal{K}, k \subseteq s \}Snippet(ci,K)={sSent(ci)kK,ks}

The semantic search tool encodes the natural language query qqq into a vector vq=femb(q)\mathbf{v}_q = f_{\text{emb}}(q)vq=femb(q) and computes cosine similarity with all sentence embeddings:

Scoresem(si,j,q)=vi,jTvqvi,jvq\mathrm{Score}_{\mathrm{sem}}(s_{i,j}, q) = \frac{ \mathbf{v}_{i,j}^T \mathbf{v}_q }{ \| \mathbf{v}_{i,j} \| \| \mathbf{v}_q \| }Scoresem(si,j,q)=vi,j∥∥vqvi,jTvq

It aggregates results by parent chunk, returning the top-kkk chunks along with their highest-scoring sentences as snippets. Finally, the chunk read tool allows the agent to access the full text of any chunk identified via prior searches, including adjacent chunks for context expansion.

The agent loop is intentionally kept simple to isolate the impact of the interface design. It follows a ReAct-like pattern: at each iteration, the LLM receives the message history and available tools, then decides whether to call a tool or produce a final answer. A context tracker maintains a set CreadC^{\text{read}}Cread of previously accessed chunks; if the agent attempts to re-read a chunk, the tool returns a zero-token notification to prevent redundancy and encourage exploration. The loop terminates when an answer is produced or a maximum iteration budget is reached, at which point the agent is prompted to synthesize a response from accumulated evidence.

This architecture enables true agentic behavior: the agent autonomously selects retrieval strategies, iterates based on intermediate results, and conditions each tool call on prior observations. Unlike fixed workflows or graph-based systems, A-RAG does not prescribe a rigid sequence of operations. Instead, it provides a flexible interface that allows the agent to dynamically decompose questions, verify findings, and re-plan as needed — all while minimizing context overhead through on-demand, incremental retrieval.

Experiment

  • A-RAG is validated as the only RAG paradigm satisfying all three principles of true agentic autonomy, distinguishing it from Graph RAG and Workflow RAG.
  • Across four multi-hop QA benchmarks, A-RAG (Full) consistently outperforms vanilla, graph-based, and workflow-based RAG methods, especially when paired with stronger LLMs like GPT-5-mini.
  • Ablation studies confirm that A-RAG’s hierarchical retrieval tools—semantic search, keyword search, and chunk read—are interdependent; removing any degrades performance, underscoring the value of multi-granularity and progressive information access.
  • Test-time scaling experiments show A-RAG effectively leverages increased computational budget (steps and reasoning effort), with stronger models benefiting more, positioning it as a scalable paradigm.
  • A-RAG achieves higher accuracy while retrieving fewer tokens than traditional RAG methods, demonstrating superior context efficiency enabled by its hierarchical interface design.
  • Failure analysis reveals a paradigm shift: while Naive RAG fails primarily due to retrieval limitations, A-RAG’s main bottleneck is reasoning chain errors—particularly entity confusion—indicating future work should focus on improving reasoning fidelity over retrieval coverage.

The authors evaluate A-RAG’s ablation variants and find that removing any retrieval tool—keyword search, semantic search, or chunk read—consistently degrades performance, confirming that multi-granularity access and progressive information disclosure are critical for effective multi-hop reasoning. A-RAG (Full) achieves the highest scores across most metrics, demonstrating that hierarchical tool interfaces enable models to autonomously select and refine relevant context while avoiding noise from irrelevant content.

The authors use A-RAG to evaluate context efficiency across methods, measuring retrieved tokens under a GPT-5-mini backbone. Results show A-RAG (Full) achieves higher accuracy while retrieving fewer or comparable tokens than traditional RAG methods, indicating superior context utilization. The hierarchical interface design enables more selective and efficient retrieval, reducing noise from irrelevant content.

The authors use a unified evaluation framework to compare A-RAG against multiple RAG paradigms across four multi-hop QA benchmarks, finding that A-RAG consistently outperforms both vanilla and structured RAG methods, especially when paired with stronger reasoning models like GPT-5-mini. Results show that granting models autonomy to dynamically select retrieval tools leads to better performance than fixed retrieval pipelines, even with minimal tooling. The full A-RAG configuration further improves outcomes by enabling progressive, multi-granularity information access, demonstrating that hierarchical interfaces enhance both accuracy and context efficiency.

The authors use a comparative table to position their proposed A-RAG method against existing RAG approaches, highlighting its unique support for autonomy, iterative reasoning, and interleaved tool use. Results show that A-RAG is the only method satisfying all three principles of true agentic autonomy, distinguishing it from prior paradigms that lack one or more of these capabilities. This structural advantage enables A-RAG to dynamically adapt retrieval strategies, contributing to its superior performance across benchmarks.

The authors analyze failure modes in A-RAG and find that reasoning chain errors dominate, with entity confusion being the most frequent subcategory—accounting for 40% of errors on MuSiQue and 71% on 2WikiMultiHopQA. This indicates that while A-RAG successfully retrieves relevant documents, the model often struggles to correctly interpret or disambiguate entities within them. The distribution of errors also varies by dataset, suggesting task-specific challenges in question understanding and strategy selection.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp