4달 전

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

초록

전방(frontier) 언어 모델은 강력한 추론 능력과 장기적 도구 사용 능력을 보여주고 있으나, 기존의 RAG 시스템은 이러한 능력을 효과적으로 활용하지 못하고 있다. 기존 접근법은 두 가지 주된 패러다임에 의존한다: (1) 단일 샷(single-shot)으로 문장 조각을 검색하여 모델 입력에 연결하는 알고리즘을 설계하는 방식, 또는 (2) 사전에 워크플로우를 정의하고 모델에 단계별로 실행하도록 프롬프트하는 방식이다. 이러한 두 가지 패러다임 모두 모델이 검색 결정에 참여할 수 있도록 하지 못하며, 이로 인해 모델 성능 향상과 함께 효율적인 확장이 불가능하다. 본 논문에서는 모델에 계층적 검색 인터페이스를 직접 노출시키는 Agentic RAG 프레임워크인 A-RAG를 제안한다. A-RAG는 키워드 검색, 의미적 검색, 청크 읽기 세 가지 검색 도구를 제공하여 에이전트가 다양한 세분화 수준에서 정보를 적응적으로 검색하고 획득할 수 있도록 한다. 다양한 개방형 QA 벤치마크에서 수행된 실험 결과, A-RAG는 기존 방법들과 비교해 유사하거나 더 낮은 검색 토큰 수를 사용하면서도 일관되게 우수한 성능을 보이며, 모델의 능력을 효과적으로 활용하고 다양한 RAG 작업에 동적으로 대응함을 입증한다. 또한 본 연구는 A-RAG가 모델 크기와 테스트 시 계산 자원에 따라 어떻게 확장되는지 체계적으로 분석하였다. 향후 연구를 촉진하기 위해 코드와 평가 세트를 공개할 예정이며, 관련 자료는 https://github.com/Ayanami0730/arag 에서 확인할 수 있다.

One-sentence Summary

Researchers from USTC and Metastone Technology propose A-RAG, an agentic RAG framework with hierarchical retrieval tools (keyword_search, semantic_search, chunk_read) that enable LLMs to autonomously adapt search strategies across granularities, outperforming prior methods on multi-hop QA benchmarks while using fewer tokens and scaling efficiently with model size and test-time compute.

Key Contributions

A-RAG addresses the limitation of static RAG systems by enabling LLMs to dynamically control retrieval through hierarchical tools—keyword_search, semantic_search, and chunk_read—allowing adaptive, multi-granularity information gathering aligned with the model’s reasoning process.
Evaluated on open-domain QA benchmarks, A-RAG consistently outperforms prior methods while using comparable or fewer retrieved tokens, validating that agent-driven retrieval leverages model capabilities more effectively than predefined workflows or single-shot retrieval.
Systematic scaling experiments show A-RAG’s performance improves with larger models and increased test-time compute, demonstrating its efficiency in scaling alongside advances in LLM capabilities and computational resources.

Introduction

The authors leverage the growing reasoning and tool-use capabilities of frontier LLMs to rethink Retrieval-Augmented Generation (RAG), which has historically relied on static, algorithm-driven retrieval or rigid, predefined workflows that limit model autonomy. Prior approaches—including Graph RAG and Workflow RAG—fail to let the model dynamically adapt its retrieval strategy based on context or task complexity, preventing efficient scaling with model improvements. The authors’ main contribution is A-RAG, an agentic framework that exposes hierarchical retrieval tools (keyword_search, semantic_search, chunk_read) directly to the model, enabling it to autonomously navigate information at multiple granularities. Experiments show A-RAG outperforms existing methods with fewer retrieved tokens and scales effectively with model size and test-time compute, proving that agent-driven retrieval interfaces are more powerful than fixed retrieval algorithms.

Dataset

The authors use only publicly available benchmarks previously curated and processed by prior research, ensuring ethical compliance.
No new data is collected, and no human subjects are involved in this work.
The focus is on advancing retrieval-augmented generation (RAG) in large language models, with no added ethical risks beyond those already present in the base models.
Dataset composition and processing follow established practices from prior studies, without introducing novel filtering, cropping, or metadata construction steps.

Method

The authors leverage a minimalist yet powerful agent-centric architecture called A-RAG, which exposes hierarchical retrieval interfaces to enable autonomous, iterative information gathering. The framework is built around three core components: a lightweight hierarchical index, a suite of granular retrieval tools, and a simple ReAct-style agent loop that facilitates dynamic strategy selection and interleaved tool use.

The hierarchical index is constructed in two stages: chunking and embedding. The corpus is partitioned into approximately 1,000-token chunks aligned with sentence boundaries to preserve semantic coherence. Each chunk is then decomposed into constituent sentences, and each sentence is embedded using a pre-trained encoder $f_{\text{emb}}$ , yielding vector representations $\mathbf{v}_{i,j} = f_{\text{emb}}(s_{i,j})$ . This sentence-level embedding enables fine-grained semantic matching while preserving the mapping to parent chunks. Crucially, keyword-level retrieval is handled at query time via exact text matching, avoiding costly offline indexing. This yields a three-tiered representation: implicit keyword-level for precise entity matching, sentence-level for semantic search, and chunk-level for full content access.

To interface with this index, the authors design three retrieval tools operating at different granularities. The keyword search tool accepts a list of keywords $\mathcal{K} = \{k_1, k_2, \ldots, k_m\}$ and returns top- $k$ chunks ranked by a weighted frequency score:

\mathrm{Score}_{\mathrm{kw}}(c_i, \mathcal{K}) = \sum_{k \in \mathcal{K}} \mathrm{count}(k, T_i) \cdot |k|

where longer keywords are weighted higher for specificity. For each matched chunk, it returns an abbreviated snippet containing only sentences that include at least one keyword:

\mathrm{Snippet}(c_i, \mathcal{K}) = \{ s \in \mathrm{Sent}(c_i) \mid \exists k \in \mathcal{K}, k \subseteq s \}

The semantic search tool encodes the natural language query $q$ into a vector $\mathbf{v}_q = f_{\text{emb}}(q)$ and computes cosine similarity with all sentence embeddings:

\mathrm{Score}_{\mathrm{sem}}(s_{i,j}, q) = \frac{ \mathbf{v}_{i,j}^T \mathbf{v}_q }{ \| \mathbf{v}_{i,j} \| \| \mathbf{v}_q \| }

It aggregates results by parent chunk, returning the top- $k$ chunks along with their highest-scoring sentences as snippets. Finally, the chunk read tool allows the agent to access the full text of any chunk identified via prior searches, including adjacent chunks for context expansion.

The agent loop is intentionally kept simple to isolate the impact of the interface design. It follows a ReAct-like pattern: at each iteration, the LLM receives the message history and available tools, then decides whether to call a tool or produce a final answer. A context tracker maintains a set $C^{\text{read}}$ of previously accessed chunks; if the agent attempts to re-read a chunk, the tool returns a zero-token notification to prevent redundancy and encourage exploration. The loop terminates when an answer is produced or a maximum iteration budget is reached, at which point the agent is prompted to synthesize a response from accumulated evidence.

This architecture enables true agentic behavior: the agent autonomously selects retrieval strategies, iterates based on intermediate results, and conditions each tool call on prior observations. Unlike fixed workflows or graph-based systems, A-RAG does not prescribe a rigid sequence of operations. Instead, it provides a flexible interface that allows the agent to dynamically decompose questions, verify findings, and re-plan as needed — all while minimizing context overhead through on-demand, incremental retrieval.

Experiment

A-RAG is validated as the only RAG paradigm satisfying all three principles of true agentic autonomy, distinguishing it from Graph RAG and Workflow RAG.
Across four multi-hop QA benchmarks, A-RAG (Full) consistently outperforms vanilla, graph-based, and workflow-based RAG methods, especially when paired with stronger LLMs like GPT-5-mini.
Ablation studies confirm that A-RAG’s hierarchical retrieval tools—semantic search, keyword search, and chunk read—are interdependent; removing any degrades performance, underscoring the value of multi-granularity and progressive information access.
Test-time scaling experiments show A-RAG effectively leverages increased computational budget (steps and reasoning effort), with stronger models benefiting more, positioning it as a scalable paradigm.
A-RAG achieves higher accuracy while retrieving fewer tokens than traditional RAG methods, demonstrating superior context efficiency enabled by its hierarchical interface design.
Failure analysis reveals a paradigm shift: while Naive RAG fails primarily due to retrieval limitations, A-RAG’s main bottleneck is reasoning chain errors—particularly entity confusion—indicating future work should focus on improving reasoning fidelity over retrieval coverage.

The authors evaluate A-RAG’s ablation variants and find that removing any retrieval tool—keyword search, semantic search, or chunk read—consistently degrades performance, confirming that multi-granularity access and progressive information disclosure are critical for effective multi-hop reasoning. A-RAG (Full) achieves the highest scores across most metrics, demonstrating that hierarchical tool interfaces enable models to autonomously select and refine relevant context while avoiding noise from irrelevant content.

The authors use A-RAG to evaluate context efficiency across methods, measuring retrieved tokens under a GPT-5-mini backbone. Results show A-RAG (Full) achieves higher accuracy while retrieving fewer or comparable tokens than traditional RAG methods, indicating superior context utilization. The hierarchical interface design enables more selective and efficient retrieval, reducing noise from irrelevant content.

The authors use a unified evaluation framework to compare A-RAG against multiple RAG paradigms across four multi-hop QA benchmarks, finding that A-RAG consistently outperforms both vanilla and structured RAG methods, especially when paired with stronger reasoning models like GPT-5-mini. Results show that granting models autonomy to dynamically select retrieval tools leads to better performance than fixed retrieval pipelines, even with minimal tooling. The full A-RAG configuration further improves outcomes by enabling progressive, multi-granularity information access, demonstrating that hierarchical interfaces enhance both accuracy and context efficiency.

The authors use a comparative table to position their proposed A-RAG method against existing RAG approaches, highlighting its unique support for autonomy, iterative reasoning, and interleaved tool use. Results show that A-RAG is the only method satisfying all three principles of true agentic autonomy, distinguishing it from prior paradigms that lack one or more of these capabilities. This structural advantage enables A-RAG to dynamically adapt retrieval strategies, contributing to its superior performance across benchmarks.

The authors analyze failure modes in A-RAG and find that reasoning chain errors dominate, with entity confusion being the most frequent subcategory—accounting for 40% of errors on MuSiQue and 71% on 2WikiMultiHopQA. This indicates that while A-RAG successfully retrieves relevant documents, the model often struggles to correctly interpret or disambiguate entities within them. The distribution of errors also varies by dataset, suggesting task-specific challenges in question understanding and strategy selection.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

초록

One-sentence Summary

Key Contributions

A-RAG addresses the limitation of static RAG systems by enabling LLMs to dynamically control retrieval through hierarchical tools—keyword_search, semantic_search, and chunk_read—allowing adaptive, multi-granularity information gathering aligned with the model’s reasoning process.
Evaluated on open-domain QA benchmarks, A-RAG consistently outperforms prior methods while using comparable or fewer retrieved tokens, validating that agent-driven retrieval leverages model capabilities more effectively than predefined workflows or single-shot retrieval.
Systematic scaling experiments show A-RAG’s performance improves with larger models and increased test-time compute, demonstrating its efficiency in scaling alongside advances in LLM capabilities and computational resources.

Introduction

Dataset

The authors use only publicly available benchmarks previously curated and processed by prior research, ensuring ethical compliance.
No new data is collected, and no human subjects are involved in this work.
The focus is on advancing retrieval-augmented generation (RAG) in large language models, with no added ethical risks beyond those already present in the base models.
Dataset composition and processing follow established practices from prior studies, without introducing novel filtering, cropping, or metadata construction steps.

Method

\mathrm{Score}_{\mathrm{kw}}(c_i, \mathcal{K}) = \sum_{k \in \mathcal{K}} \mathrm{count}(k, T_i) \cdot |k|

where longer keywords are weighted higher for specificity. For each matched chunk, it returns an abbreviated snippet containing only sentences that include at least one keyword:

\mathrm{Snippet}(c_i, \mathcal{K}) = \{ s \in \mathrm{Sent}(c_i) \mid \exists k \in \mathcal{K}, k \subseteq s \}

The semantic search tool encodes the natural language query $q$ into a vector $\mathbf{v}_q = f_{\text{emb}}(q)$ and computes cosine similarity with all sentence embeddings:

\mathrm{Score}_{\mathrm{sem}}(s_{i,j}, q) = \frac{ \mathbf{v}_{i,j}^T \mathbf{v}_q }{ \| \mathbf{v}_{i,j} \| \| \mathbf{v}_q \| }

Experiment

A-RAG is validated as the only RAG paradigm satisfying all three principles of true agentic autonomy, distinguishing it from Graph RAG and Workflow RAG.
Across four multi-hop QA benchmarks, A-RAG (Full) consistently outperforms vanilla, graph-based, and workflow-based RAG methods, especially when paired with stronger LLMs like GPT-5-mini.
Ablation studies confirm that A-RAG’s hierarchical retrieval tools—semantic search, keyword search, and chunk read—are interdependent; removing any degrades performance, underscoring the value of multi-granularity and progressive information access.
Test-time scaling experiments show A-RAG effectively leverages increased computational budget (steps and reasoning effort), with stronger models benefiting more, positioning it as a scalable paradigm.
A-RAG achieves higher accuracy while retrieving fewer tokens than traditional RAG methods, demonstrating superior context efficiency enabled by its hierarchical interface design.
Failure analysis reveals a paradigm shift: while Naive RAG fails primarily due to retrieval limitations, A-RAG’s main bottleneck is reasoning chain errors—particularly entity confusion—indicating future work should focus on improving reasoning fidelity over retrieval coverage.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

A-RAG: 계층적 검색 인터페이스를 통한 에이전트 기반 검색 증강 생성의 확장성 향상

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

A-RAG: 계층적 검색 인터페이스를 통한 에이전트 기반 검색 증강 생성의 확장성 향상

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

A-RAG: 계층적 검색 인터페이스를 통한 에이전트 기반 검색 증강 생성의 확장성 향상

Mingxuan Du Benfeng Xu Chiwei Zhu Shaohan Wang Pengyu Wang Xiaorui Wang Zhendong Mao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters