한 달 전

Yuqi Zhou Sunhao Dai Changle Qu Liang Pang Jun Xu Ji-Rong Wen

초록

정보 검색(Information Retrieval, IR) 시스템은 전통적으로 인간 사용자를 위해 설계 및 학습되어 왔으며, 랭킹 학습(learning-to-rank) 방식은 클릭이나 체류 시간(dwell time)과 같은 대규모 인간 상호작용 로그에 크게 의존해 왔습니다. 그러나 LLM 기반의 search agent가 급격히 등장함에 따라, 검색은 인간보다는 agent에 의해 소비되는 비중이 점점 높아지고 있으며, 다회차 추론 및 행동 루프(multi-turn reasoning and action loops) 내의 핵심 구성 요소로 내재화되고 있습니다. 이러한 환경에서 인간 중심적 가정하에 학습된 retrieval 모델은 agent가 query를 발행하고 결과를 소비하는 방식과 근본적인 불일치를 보입니다.본 연구에서는 agentic search를 위한 retrieval 모델이 agent의 상호작용 데이터로부터 직접 학습되어야 한다고 주장합니다. 우리는 agent의 trajectory로부터 retrieval을 학습하는 'learning to retrieve from agent trajectories'를 새로운 학습 패러다임으로 제시하며, 여기서 supervision은 다단계 agent 상호작용으로부터 도출됩니다. search agent trajectory에 대한 체계적인 분석을 통해, 우리는 browsing action, 미열람 거절(unbrowsed rejections), 열람 후 추론 흔적(post-browse reasoning traces)을 포함하여 문서의 효용성을 드러내는 핵심 행동 신호들을 식별합니다.이러한 통찰을 바탕으로, 우리는 agent trajectory에서 고품질의 retrieval supervision을 채굴하고 가중 최적화(weighted optimization)를 통해 관련성 강도(relevance intensity)를 통합하는 단순하면서도 효과적인 프레임워크인 LRAT를 제안합니다. In-domain 및 out-of-domain의 deep research benchmark 모두에서 진행된 광범위한 실험 결과, LRAT로 학습된 retriever는 다양한 agent architecture 및 scale에 걸쳐 evidence recall, end-to-end task success, 그리고 실행 효율성을 일관되게 향상시키는 것으로 나타났습니다. 우리의 연구 결과는 agent trajectory가 실용적이고 확장 가능한 supervision 소스임을 강조하며, agentic search 시대의 retrieval 연구를 위한 유망한 방향을 제시합니다.

One-sentence Summary

To address the misalignment between human-centric retrieval models and agentic search, researchers from Renmin University of China and the Chinese Academy of Sciences propose LRAT, a framework that optimizes retrievers by mining high-quality supervision from agent trajectories—specifically through browsing actions, unbrowsed rejections, and post-browse reasoning traces—to enhance evidence recall, task success, and execution efficiency across diverse agent architectures and scales.

Key Contributions

This work formalizes a new training paradigm called learning to retrieve from agent trajectories, which derives supervision directly from multi-step agent interaction data rather than human-centric logs.
The paper introduces LRAT, a framework that converts agent trajectories into high-quality retrieval supervision by mining behavioral signals such as browsing actions, unbrowsed rejections, and post-browse reasoning traces.
Extensive experiments on in-domain and out-of-domain deep research benchmarks demonstrate that LRAT improves evidence recall, end-to-end task success, and execution efficiency across various agent architectures and scales.

Introduction

As large language model (LLM) powered agents increasingly perform complex, multi-turn reasoning tasks, retrieval has shifted from a standalone service for humans to a core component of autonomous agent loops. Traditional retrieval models are trained on human centric data, such as clicks and dwell time, which creates a fundamental mismatch because agent queries are driven by intermediate reasoning objectives rather than immediate informational needs. To bridge this gap, the authors propose a new training paradigm called Learning to Retrieve from Agent Trajectories (LRAT). The authors leverage multi-step agent interaction data to mine high-quality supervision signals, such as browsing actions, unbrowsed rejections, and post-browse reasoning traces. This framework allows for the training of retrievers that are directly aligned with agent behaviors, improving evidence recall and task success across various architectures.

Dataset

The authors construct a specialized dataset of agent trajectories designed to model sustained search and browsing behavior through the following process:

Dataset Composition and Sources: The authors utilize InfoSeekQA as the foundational seed dataset. This benchmark contains over 50,000 question-answer pairs that require hierarchical reasoning and iterative information acquisition. For the underlying knowledge base, they use the Wiki-25-Dump corpus, which consists of more than 11.2 million document chunks.
Subset Details and Filtering: From the initial InfoSeekQA pool, the authors select a seed set of the top 10,000 queries that feature verified ground-truth answers. To ensure data quality, they filter out any generated trajectories that exceed the maximum step limit or result in incorrect final answers. Answer correctness is validated by comparing agent outputs against the ground truth using the Qwen3-30B-A3B-Thinking-2507 model.
Data Processing and Trajectory Generation: The authors generate trajectories by executing a search agent on each seed query within a simulated environment. During execution, the agent produces reasoning traces and performs [Search] or [Browse] actions. For the corpus, document chunks are truncated to a fixed length of 512 tokens.
Model Usage: The resulting trajectories, which capture deep search processes and long interaction sequences, serve as the primary data for the authors' analysis.

Method

The authors leverage a framework for training retrieval models directly from deep research agent trajectories, referred to as LRAT. This approach is designed to capture the nuanced ways in which agents interact with external information systems, using these interactions to generate high-quality, utility-aware training signals. The overall framework operates in three key stages: relevance signal mining, reasoning-aware positive filtering, and intensity-aware training.

The process begins with the generation of deep research agent trajectories. As shown in the figure below, an agent starts with an initial user query $q$ and proceeds through a series of iterative reasoning and action steps. At each turn $t$ , the agent produces a reasoning state $r_t$ , which guides its action $a_t$ . The agent can perform either a [Search] action, generating an intermediate query $q_t$ and receiving a ranked list of candidate documents $\mathcal{D}_t$ , or a [Browse] action, which retrieves the full content of a previously identified document $d_t$ . The agent's reasoning state is updated with the observed information $o_t$ from the retrieval system. This cycle continues until the agent determines that sufficient information has been gathered to generate a final answer $y$ . The trajectory captures the agent's decision-making process, including its information needs, retrieval choices, and information consumption.

The first stage of the LRAT framework is relevance signal mining. The authors start by constructing coarse supervision from the agent's [Search] → [Browse] transitions. For a search turn $t$ , if the agent subsequently browses a document $d_{t+1}$ at turn $t+1$ , that document is considered a naive positive sample. All other documents in the same retrieved set that are not browsed are treated as naive negatives, forming a training instance $(q_t, d_{t+1}, \mathcal{N}_t)$ . However, browsing actions are imperfect indicators of relevance, as agents may browse documents that ultimately prove unhelpful. To refine these positives, the authors introduce a reasoning-aware filtering step. They use a large language model (LLM) as a judge to analyze the agent's reasoning trace $r_{t+2}$ immediately following the browsing action. The LLM determines whether the reasoning explicitly uses the content of the browsed document to make progress on the task, thereby filtering out noisy, browsed-but-unhelpful documents while preserving high-quality positive examples.

The final stage of the framework is intensity-aware training. The authors recognize that agent trajectories not only indicate relevance but also reveal the intensity of relevance. They propose an estimation scheme based on the length of the agent's post-browse reasoning trace. The analysis shows that longer reasoning chains following a browsing action are strongly correlated with higher document usefulness, analogous to human dwell time in search. To model this, they use an exponential saturation function to map the reasoning length $l$ to a bounded utility score. The relevance intensity weight $w$ is computed as $w = \frac{1}{\mu_{\text{raw}}} (1 - \exp(-\frac{\ln 2 \cdot l}{\beta}))$ , where $\beta$ is the median reasoning length across all trajectories and $\mu_{\text{raw}}$ is the global mean of the unnormalized scores. This weight is then used in a weighted contrastive learning objective, where the loss function penalizes the model more heavily for incorrectly ranking documents with high relevance intensity. The overall process involves updating the retriever model iteratively based on these intensity-weighted signals, resulting in a retrieval system that is better aligned with the actual utility of documents in the context of complex information-seeking tasks.

Experiment

The researchers evaluate the LRAT framework by analyzing deep research agent trajectories to determine how browsing behavior and post-browse reasoning indicate document utility. By training retrievers using these trajectory-derived signals, the study validates that browsing is a necessary condition for task success and that reasoning length serves as a reliable proxy for relevance. Experimental results across diverse agent architectures and benchmarks demonstrate that this approach consistently improves evidence recall, increases task success rates, and enhances execution efficiency by reducing unnecessary interaction steps.

The authors evaluate the LRAT framework on multiple agent backbones, showing consistent improvements in success rate when using both correct and incorrect trajectories for training. Results indicate that leveraging agent interaction data enhances retrieval quality and task performance across different models. LRAT improves success rates across all evaluated agents using both correct and incorrect trajectories. Training with incorrect trajectories still yields significant gains, suggesting useful supervision from failed interactions. The improvements are consistent across different agent scales, indicating broad effectiveness of the approach.

The authors compare the success rates of different large language models when using various retrieval methods. Results show that the proposed method consistently improves success rates across all models, with the most significant gains observed in the MiniMax-M2.1 and GLM-4.7 systems. The improvements are attributed to enhanced retrieval quality and more efficient agent execution. The proposed method consistently improves success rates across all evaluated models. The largest gains are observed in the MiniMax-M2.1 and GLM-4.7 systems. Improved performance is attributed to better retrieval quality and more efficient agent execution.

The authors evaluate the LRAT framework on multiple search agents and retrievers, showing consistent improvements in task success and retrieval quality. Results demonstrate that the approach enhances performance across different agent scales and retrieval models, particularly in in-domain and out-of-domain benchmarks. LRAT consistently improves success rate and evidence recall across all agent and retriever configurations. The framework reduces the average number of steps required for task completion, indicating more efficient agent execution. Performance gains are observed across diverse architectures, including both task-optimized and generalist agents.

The graphs show performance trends as the number of interaction loops increases. Agent success rate improves with more steps, while retriever recall also increases, indicating better evidence retrieval over time. Both metrics show consistent gains across loop steps, suggesting that additional interaction contributes to better outcomes. Agent success rate increases with more interaction loops Retriever recall improves as the number of loops increases Both agent and retriever performance show consistent gains over multiple steps

Agent and retriever performance over steps

The authors compare the success rates of various agents using different retrievers, showing that the proposed LRAT method consistently improves performance across all agents. Results indicate that LRAT enhances task success, reduces the number of required steps, and improves evidence retrieval, with gains observed in both in-domain and out-of-domain benchmarks. LRAT consistently improves success rates across all agents on both benchmarks. The method reduces the average number of interaction steps required to complete tasks. LRAT enhances evidence retrieval quality, leading to better end-to-end performance.

The authors evaluate the LRAT framework across various agent backbones, retriever configurations, and task domains to validate its effectiveness in enhancing task success and retrieval quality. The results demonstrate that leveraging both correct and incorrect interaction trajectories consistently improves performance and reduces the number of steps required for task completion. Ultimately, the framework shows broad applicability across different agent scales and architectures, providing significant gains in both in-domain and out-of-domain benchmarks.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

한 달 전

Yuqi Zhou Sunhao Dai Changle Qu Liang Pang Jun Xu Ji-Rong Wen

초록

One-sentence Summary

Key Contributions

This work formalizes a new training paradigm called learning to retrieve from agent trajectories, which derives supervision directly from multi-step agent interaction data rather than human-centric logs.
The paper introduces LRAT, a framework that converts agent trajectories into high-quality retrieval supervision by mining behavioral signals such as browsing actions, unbrowsed rejections, and post-browse reasoning traces.
Extensive experiments on in-domain and out-of-domain deep research benchmarks demonstrate that LRAT improves evidence recall, end-to-end task success, and execution efficiency across various agent architectures and scales.

Introduction

Dataset

The authors construct a specialized dataset of agent trajectories designed to model sustained search and browsing behavior through the following process:

Dataset Composition and Sources: The authors utilize InfoSeekQA as the foundational seed dataset. This benchmark contains over 50,000 question-answer pairs that require hierarchical reasoning and iterative information acquisition. For the underlying knowledge base, they use the Wiki-25-Dump corpus, which consists of more than 11.2 million document chunks.
Subset Details and Filtering: From the initial InfoSeekQA pool, the authors select a seed set of the top 10,000 queries that feature verified ground-truth answers. To ensure data quality, they filter out any generated trajectories that exceed the maximum step limit or result in incorrect final answers. Answer correctness is validated by comparing agent outputs against the ground truth using the Qwen3-30B-A3B-Thinking-2507 model.
Data Processing and Trajectory Generation: The authors generate trajectories by executing a search agent on each seed query within a simulated environment. During execution, the agent produces reasoning traces and performs [Search] or [Browse] actions. For the corpus, document chunks are truncated to a fixed length of 512 tokens.
Model Usage: The resulting trajectories, which capture deep search processes and long interaction sequences, serve as the primary data for the authors' analysis.

Method

Experiment

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

Agent Trajectories로부터 Retrieval하는 법 배우기

Yuqi Zhou Sunhao Dai Changle Qu Liang Pang Jun Xu Ji-Rong Wen

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Agent Trajectories로부터 Retrieval하는 법 배우기

Yuqi Zhou Sunhao Dai Changle Qu Liang Pang Jun Xu Ji-Rong Wen

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Agent Trajectories로부터 Retrieval하는 법 배우기

Yuqi Zhou Sunhao Dai Changle Qu Liang Pang Jun Xu Ji-Rong Wen

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters