HyperAIHyperAI

Command Palette

Search for a command to run...

A-RAG:階層的リトリーブインターフェースを活用したエージェント型リトリーブ増強生成のスケーラビリティ向上

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

概要

先進的な言語モデルは、強力な推論能力および長期的なツール利用能力を示している。しかし、現存するRAG(Retrieval-Augmented Generation)システムは、これらの能力を十分に活用できていない。現行の手法は、主に以下の2つのパラダイムに依存している:(1)一度の検索で文章を取得し、それをモデルの入力に連結するアルゴリズムを設計する方法、または(2)事前にワークフローを定義し、モデルに段階的に実行を促すプロンプトを提示する方法である。いずれのアプローチも、モデルが検索の意思決定に参加できないという制約を抱えており、モデルの性能向上に伴う効率的なスケーリングを阻害している。本論文では、階層的な検索インターフェースをモデルに直接提示する「A-RAG」と呼ばれるエージェント型RAGフレームワークを提案する。A-RAGは、キーワード検索、意味検索、チャンク読み取りの3つの検索ツールを提供し、エージェントが複数の粒度にわたって適応的に情報検索と取得を行うことを可能にする。複数のオープンドメインQAベンチマークにおける実験結果から、A-RAGは、比較的少ないまたは同等の検索トークン数で、既存手法を一貫して上回ることを示した。これは、A-RAGがモデルの能力を効果的に活用し、異なるRAGタスクに動的に適応できることを裏付けている。さらに、A-RAGがモデルサイズおよび推論時計算リソースの増加にどうスケーリングするかを体系的に検証した。本研究で開発したコードおよび評価セットは、今後の研究を促進するため、公開する予定である。コードと評価セットは、https://github.com/Ayanami0730/arag にて入手可能である。

One-sentence Summary

Researchers from USTC and Metastone Technology propose A-RAG, an agentic RAG framework with hierarchical retrieval tools (keyword_search, semantic_search, chunk_read) that enable LLMs to autonomously adapt search strategies across granularities, outperforming prior methods on multi-hop QA benchmarks while using fewer tokens and scaling efficiently with model size and test-time compute.

Key Contributions

  • A-RAG addresses the limitation of static RAG systems by enabling LLMs to dynamically control retrieval through hierarchical tools—keyword_search, semantic_search, and chunk_read—allowing adaptive, multi-granularity information gathering aligned with the model’s reasoning process.
  • Evaluated on open-domain QA benchmarks, A-RAG consistently outperforms prior methods while using comparable or fewer retrieved tokens, validating that agent-driven retrieval leverages model capabilities more effectively than predefined workflows or single-shot retrieval.
  • Systematic scaling experiments show A-RAG’s performance improves with larger models and increased test-time compute, demonstrating its efficiency in scaling alongside advances in LLM capabilities and computational resources.

Introduction

The authors leverage the growing reasoning and tool-use capabilities of frontier LLMs to rethink Retrieval-Augmented Generation (RAG), which has historically relied on static, algorithm-driven retrieval or rigid, predefined workflows that limit model autonomy. Prior approaches—including Graph RAG and Workflow RAG—fail to let the model dynamically adapt its retrieval strategy based on context or task complexity, preventing efficient scaling with model improvements. The authors’ main contribution is A-RAG, an agentic framework that exposes hierarchical retrieval tools (keyword_search, semantic_search, chunk_read) directly to the model, enabling it to autonomously navigate information at multiple granularities. Experiments show A-RAG outperforms existing methods with fewer retrieved tokens and scales effectively with model size and test-time compute, proving that agent-driven retrieval interfaces are more powerful than fixed retrieval algorithms.

Dataset

  • The authors use only publicly available benchmarks previously curated and processed by prior research, ensuring ethical compliance.
  • No new data is collected, and no human subjects are involved in this work.
  • The focus is on advancing retrieval-augmented generation (RAG) in large language models, with no added ethical risks beyond those already present in the base models.
  • Dataset composition and processing follow established practices from prior studies, without introducing novel filtering, cropping, or metadata construction steps.

Method

The authors leverage a minimalist yet powerful agent-centric architecture called A-RAG, which exposes hierarchical retrieval interfaces to enable autonomous, iterative information gathering. The framework is built around three core components: a lightweight hierarchical index, a suite of granular retrieval tools, and a simple ReAct-style agent loop that facilitates dynamic strategy selection and interleaved tool use.

The hierarchical index is constructed in two stages: chunking and embedding. The corpus is partitioned into approximately 1,000-token chunks aligned with sentence boundaries to preserve semantic coherence. Each chunk is then decomposed into constituent sentences, and each sentence is embedded using a pre-trained encoder fembf_{\text{emb}}femb, yielding vector representations vi,j=femb(si,j)\mathbf{v}_{i,j} = f_{\text{emb}}(s_{i,j})vi,j=femb(si,j). This sentence-level embedding enables fine-grained semantic matching while preserving the mapping to parent chunks. Crucially, keyword-level retrieval is handled at query time via exact text matching, avoiding costly offline indexing. This yields a three-tiered representation: implicit keyword-level for precise entity matching, sentence-level for semantic search, and chunk-level for full content access.

To interface with this index, the authors design three retrieval tools operating at different granularities. The keyword search tool accepts a list of keywords K={k1,k2,,km}\mathcal{K} = \{k_1, k_2, \ldots, k_m\}K={k1,k2,,km} and returns top-kkk chunks ranked by a weighted frequency score:

Scorekw(ci,K)=kKcount(k,Ti)k\mathrm{Score}_{\mathrm{kw}}(c_i, \mathcal{K}) = \sum_{k \in \mathcal{K}} \mathrm{count}(k, T_i) \cdot |k|Scorekw(ci,K)=kKcount(k,Ti)k

where longer keywords are weighted higher for specificity. For each matched chunk, it returns an abbreviated snippet containing only sentences that include at least one keyword:

Snippet(ci,K)={sSent(ci)kK,ks}\mathrm{Snippet}(c_i, \mathcal{K}) = \{ s \in \mathrm{Sent}(c_i) \mid \exists k \in \mathcal{K}, k \subseteq s \}Snippet(ci,K)={sSent(ci)kK,ks}

The semantic search tool encodes the natural language query qqq into a vector vq=femb(q)\mathbf{v}_q = f_{\text{emb}}(q)vq=femb(q) and computes cosine similarity with all sentence embeddings:

Scoresem(si,j,q)=vi,jTvqvi,jvq\mathrm{Score}_{\mathrm{sem}}(s_{i,j}, q) = \frac{ \mathbf{v}_{i,j}^T \mathbf{v}_q }{ \| \mathbf{v}_{i,j} \| \| \mathbf{v}_q \| }Scoresem(si,j,q)=vi,j∥∥vqvi,jTvq

It aggregates results by parent chunk, returning the top-kkk chunks along with their highest-scoring sentences as snippets. Finally, the chunk read tool allows the agent to access the full text of any chunk identified via prior searches, including adjacent chunks for context expansion.

The agent loop is intentionally kept simple to isolate the impact of the interface design. It follows a ReAct-like pattern: at each iteration, the LLM receives the message history and available tools, then decides whether to call a tool or produce a final answer. A context tracker maintains a set CreadC^{\text{read}}Cread of previously accessed chunks; if the agent attempts to re-read a chunk, the tool returns a zero-token notification to prevent redundancy and encourage exploration. The loop terminates when an answer is produced or a maximum iteration budget is reached, at which point the agent is prompted to synthesize a response from accumulated evidence.

This architecture enables true agentic behavior: the agent autonomously selects retrieval strategies, iterates based on intermediate results, and conditions each tool call on prior observations. Unlike fixed workflows or graph-based systems, A-RAG does not prescribe a rigid sequence of operations. Instead, it provides a flexible interface that allows the agent to dynamically decompose questions, verify findings, and re-plan as needed — all while minimizing context overhead through on-demand, incremental retrieval.

Experiment

  • A-RAG is validated as the only RAG paradigm satisfying all three principles of true agentic autonomy, distinguishing it from Graph RAG and Workflow RAG.
  • Across four multi-hop QA benchmarks, A-RAG (Full) consistently outperforms vanilla, graph-based, and workflow-based RAG methods, especially when paired with stronger LLMs like GPT-5-mini.
  • Ablation studies confirm that A-RAG’s hierarchical retrieval tools—semantic search, keyword search, and chunk read—are interdependent; removing any degrades performance, underscoring the value of multi-granularity and progressive information access.
  • Test-time scaling experiments show A-RAG effectively leverages increased computational budget (steps and reasoning effort), with stronger models benefiting more, positioning it as a scalable paradigm.
  • A-RAG achieves higher accuracy while retrieving fewer tokens than traditional RAG methods, demonstrating superior context efficiency enabled by its hierarchical interface design.
  • Failure analysis reveals a paradigm shift: while Naive RAG fails primarily due to retrieval limitations, A-RAG’s main bottleneck is reasoning chain errors—particularly entity confusion—indicating future work should focus on improving reasoning fidelity over retrieval coverage.

The authors evaluate A-RAG’s ablation variants and find that removing any retrieval tool—keyword search, semantic search, or chunk read—consistently degrades performance, confirming that multi-granularity access and progressive information disclosure are critical for effective multi-hop reasoning. A-RAG (Full) achieves the highest scores across most metrics, demonstrating that hierarchical tool interfaces enable models to autonomously select and refine relevant context while avoiding noise from irrelevant content.

The authors use A-RAG to evaluate context efficiency across methods, measuring retrieved tokens under a GPT-5-mini backbone. Results show A-RAG (Full) achieves higher accuracy while retrieving fewer or comparable tokens than traditional RAG methods, indicating superior context utilization. The hierarchical interface design enables more selective and efficient retrieval, reducing noise from irrelevant content.

The authors use a unified evaluation framework to compare A-RAG against multiple RAG paradigms across four multi-hop QA benchmarks, finding that A-RAG consistently outperforms both vanilla and structured RAG methods, especially when paired with stronger reasoning models like GPT-5-mini. Results show that granting models autonomy to dynamically select retrieval tools leads to better performance than fixed retrieval pipelines, even with minimal tooling. The full A-RAG configuration further improves outcomes by enabling progressive, multi-granularity information access, demonstrating that hierarchical interfaces enhance both accuracy and context efficiency.

The authors use a comparative table to position their proposed A-RAG method against existing RAG approaches, highlighting its unique support for autonomy, iterative reasoning, and interleaved tool use. Results show that A-RAG is the only method satisfying all three principles of true agentic autonomy, distinguishing it from prior paradigms that lack one or more of these capabilities. This structural advantage enables A-RAG to dynamically adapt retrieval strategies, contributing to its superior performance across benchmarks.

The authors analyze failure modes in A-RAG and find that reasoning chain errors dominate, with entity confusion being the most frequent subcategory—accounting for 40% of errors on MuSiQue and 71% on 2WikiMultiHopQA. This indicates that while A-RAG successfully retrieves relevant documents, the model often struggles to correctly interpret or disambiguate entities within them. The distribution of errors also varies by dataset, suggesting task-specific challenges in question understanding and strategy selection.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています