HyperAIHyperAI

Command Palette

Search for a command to run...

REDSearcher: 장기 호라이즌 검색 에이전트를 위한 확장 가능하고 비용 효율적인 프레임워크

초록

대규모 언어 모델은 일반적인 지식 엔진에서 실제 문제 해결 도구로 전환되고 있으나, 깊이 있는 검색 작업에 최적화하는 것은 여전히 도전 과제로 남아 있다. 이 문제의 핵심 장애물은 고품질 검색 경로와 보상 신호의 극도로 낮은 밀도에 있다. 이는 장기적인 수평 작업을 확장 가능하게 구성하는 데 어려움이 있으며, 외부 도구 호출을 포함하는 상호작용이 많은 시뮬레이션(롤아웃)의 비용이 매우 높기 때문이다. 이러한 문제를 해결하기 위해 우리는 확장 가능한 검색 에이전트 최적화를 위한 통합 프레임워크인 REDSearcher를 제안한다. 이 프레임워크는 복잡한 작업 생성, 중간 훈련, 후속 훈련을 함께 설계한다. 구체적으로 REDSearcher는 다음과 같은 개선점을 도입한다: (1) 작업 생성을 이중 제약 최적화 문제로 정의하여, 작업 난이도를 그래프 구조와 증거 분포에 의해 정밀하게 조절함으로써, 복잡하고 고품질의 작업을 확장 가능하게 생성할 수 있다. (2) 도구 보강형 질의(query)를 도입하여, 단순한 피드백 회수보다 능동적인 도구 사용을 유도한다. (3) 중간 훈련 단계에서 핵심 원자적 능력인 지식 처리, 계획 수립, 함수 호출 능력을 강화함으로써, 후속 훈련을 위한 고품질 경로 수집 비용을 크게 감소시킨다. (4) 강화 학습 실험을 위한 빠르고 저비용의 알고리즘 반복을 가능하게 하는 로컬 시뮬레이션 환경을 구축한다. 텍스트 중심 및 다중 모달 검색 에이전트 벤치마크에서 본 연구 방법은 최첨단 성능을 달성하였다. 향후 장기 수평 검색 에이전트에 관한 연구를 촉진하기 위해, 우리는 1만 개의 고품질 복잡한 텍스트 검색 경로, 5천 개의 다중 모달 경로, 1천 개의 텍스트 기반 강화 학습 질의 세트를 공개할 예정이며, 코드 및 모델 체크포인트와 함께 제공할 계획이다.

One-sentence Summary

The REDSearcher team proposes a unified framework for optimizing search agents by co-designing task synthesis, mid-training, and post-training, using graph-constrained task generation and tool-augmented queries to reduce reliance on costly real-world rollouts, achieving SOTA across text and multimodal benchmarks while releasing 16K trajectories and code.

Key Contributions

  • REDSearcher addresses the scarcity of high-quality search trajectories by synthesizing complex tasks via dual constraints—graph treewidth for logical complexity and evidence dispersion—enabling scalable generation of long-horizon reasoning problems that demand iterative planning and cross-document synthesis.
  • It introduces tool-augmented queries and mid-training reinforcement of core capabilities (knowledge, planning, function calling) to promote proactive tool use and reduce the cost of collecting high-quality trajectories, while a local simulated environment enables rapid, low-cost RL experimentation.
  • Evaluated on text-only and multimodal benchmarks, REDSearcher achieves state-of-the-art performance and releases 10K text, 5K multimodal search trajectories, and 1K RL queries to support future research on deep search agents.

Introduction

The authors leverage large language models to tackle long-horizon search tasks—where agents must plan, retrieve, and synthesize information across multiple steps and sources—but note that prior work struggles with sparse high-quality training data and prohibitive costs from live tool interactions. Existing datasets often lack structural complexity and rely on simplistic, linear reasoning, while real-world search demands handling cyclic or fully coupled constraints that require maintaining entangled hypotheses. REDSearcher addresses this by co-designing task synthesis, mid-training, and reinforcement learning: it generates complex tasks using treewidth-guided graph topology and evidence dispersion, injects tool-augmented queries to promote proactive tool use, strengthens core subskills early to reduce rollout costs, and deploys a simulated environment for rapid RL iteration. The result is a scalable, cost-efficient framework that achieves state-of-the-art performance on both text and multimodal search benchmarks, backed by public releases of 16K high-quality trajectories and training artifacts.

Dataset

  • The authors construct a synthetic dataset to train deep search agents capable of handling multi-hop, ambiguous, and non-linear queries—tasks that demand iterative tool use and evidence synthesis, which existing open-source datasets lack.

  • The dataset is generated via a scalable, controllable synthesis pipeline that combines signals from local knowledge bases and cached webpages, intentionally increasing difficulty through fuzzing and complex constraints.

  • To ensure quality and challenge, the authors apply a five-stage verifier pipeline:

    1. LLM solver pre-filter: removes instances solvable without tools.
    2. Retrievability check: filters out questions whose answers don’t appear in top-50 search snippets.
    3. Hallucination/inconsistency check: uses an LLM verifier to detect contradictions between evidence and question-answer pairs.
    4. Agent rollout verification: runs strong tool-using agents across multiple rollouts; keeps instances where at least one rollout succeeds and records pass rate as confidence.
    5. Answer uniqueness check: discards instances with plausible alternative answers to reduce ambiguity.
  • A quality study confirms 85%+ of 500 human-verified instances are logically consistent and grounded; a strong model (DeepSeek-V3.2) achieves ~40% accuracy, while humans solve 47% within 30 minutes—validating the dataset’s realistic difficulty.

  • For training, the authors generate multi-turn tool-calling data simulating ReACT loops using LLMs to create tool sets, queries, and environmental feedback—avoiding costly real API calls.

  • Long-horizon interaction trajectories (up to 128K context) are synthesized using a local simulated web environment built from Wikipedia and web crawl dumps, ensuring solvability and enabling training on complex, multi-step search tasks.

  • The dataset includes highly intricate, real-world-inspired questions requiring cross-domain reasoning, such as identifying record pressing plants, healthcare facilities, racing events, and historical sites—all grounded in synthesized but plausible evidence.

  • No cropping is applied; metadata is constructed implicitly through the synthesis pipeline, embedding grounding signals (e.g., KB triples, cached passages) and confidence metrics (e.g., agent pass rates) for each instance.

Method

The authors leverage a structured, multi-phase training framework to develop REDSearcher, a tool-augmented agent capable of deep, long-horizon search across text and multimodal domains. The architecture is built upon a scalable task synthesis pipeline, a two-stage mid-training regimen, and a post-training phase that combines supervised fine-tuning with reinforcement learning. Each component is designed to address the sparsity of supervision and the computational cost of real-world interaction.

The core of the method begins with the scalable task synthesis pipeline, which generates complex, verifiable QA pairs by constructing reasoning graphs with controlled structural and distributional complexity. As shown in the figure below, the pipeline initiates with a seed entity set drawn from Wikipedia, from which a directed acyclic graph (DAG) is built using both structured Wikidata relations and web-based hyperlink traversal. This graph is then enriched by an LLM-driven agent to introduce cycles and interlocking constraints, increasing the treewidth and forcing the solver to maintain multiple hypotheses. Subgraph sampling extracts multiple reasoning contexts from each master graph, and an LLM generates natural-language questions anchored to these topologies. A critical innovation is the tool-enforced query evolution: static entities are replaced with operational constraints (e.g., routing queries or citation-based lookups) that require external tool invocation, ensuring that successful completion is contingent on tool use.

To ensure quality and difficulty, a verifier pipeline filters out solvable instances. The LLM solver checks for hallucinations and API retrievability, while an agent solver performs n-rollouts to validate answer consistency. Only QA pairs that survive this multi-stage filtering are retained for training. For multimodal tasks, the pipeline injects visual constraints by anchoring intermediate nodes to images and enforcing cross-modal dependencies, ensuring that visual understanding is necessary for task completion.

The training process is divided into two major phases: mid-training and post-training. As illustrated in the framework diagram, the model begins from an open-source LLM checkpoint and progresses through atomic capability acquisition (Stage 1, 32K context) and composite capability development (Stage 2, 128K context) during mid-training. This is followed by agentic supervised fine-tuning (SFT) and reinforcement learning (RL) in the post-training phase.

Mid-training is further decomposed into two phases. Phase I focuses on internal cognitive optimization: intent-anchored grounding, which teaches the model to extract relevant facts from noisy web pages under specific query intents, and hierarchical planning, which enables decomposition of ambiguous goals into concrete subtasks. Phase II introduces external environmental interaction, where the model learns to execute tool calls and maintain state across long-horizon trajectories. This staged approach allows the model to warm-start with foundational skills before engaging in costly real-world rollouts.

In post-training, the model undergoes supervised fine-tuning on high-quality ReAct-style trajectories generated in real-world environments using five tool interfaces: search, visit, Python interpreter, Google Scholar, and Google Maps. The SFT objective masks environment observations to prevent gradient contamination. Subsequently, agentic reinforcement learning is applied using GRPO with verifiable rewards. The reward is binary (0/1) based on answer correctness, and advantages are normalized within groups of rollouts per question to stabilize training. To accelerate experimentation, a functionally equivalent simulation environment is used during RL, which mimics real APIs while ensuring evidence completeness and injecting realistic noise. The simulation environment is built from cached web data and includes URL obfuscation to prevent model bias. Asynchronous rollouts and a two-tier load balancing strategy are employed to handle the computational demands of long trajectories.

The entire framework is designed to scale efficiently: task synthesis reuses graphs to amortize LLM costs, mid-training avoids real-world interaction until necessary, and RL leverages a curated, agent-verified query set to ensure clean learning signals. The result is a deep-search agent that can iteratively acquire evidence, maintain hypotheses, and synthesize information across multiple sources and modalities.

Experiment

  • REDSearcher sets a new state-of-the-art among open-source 30B-parameter agents, outperforming both open and proprietary models on complex benchmarks like GAIA, demonstrating superior parameter efficiency and deep research capability.
  • Mid-training stages progressively enhance performance: Stage I improves grounding and planning, especially on GAIA; Stage II enables robust tool use and long-horizon execution, significantly boosting performance on BrowseComp-ZH.
  • Reinforcement learning further refines capabilities, improving overall scores and reducing tool usage by 10.4% without sacrificing accuracy, indicating more efficient and strategic search behavior.
  • Tool-use analysis reveals REDSearcher relies minimally on parametric knowledge, excelling only when tools are enabled—highlighting strong planning, evidence synthesis, and iterative reasoning over memorization.
  • Multimodal experiments show strong performance across vision-language benchmarks, outperforming large proprietary models and a Qwen3-VL baseline, with capabilities transferring well to text-only tasks.
  • Analysis of tool usage patterns shows adaptive behavior: simpler tasks require fewer turns, while complex ones involve more decomposition, reflection, and verification; RL training reduces unnecessary search steps, especially on easier benchmarks.

The authors use a staged mid-training approach to progressively enhance the model’s agentic capabilities, with each stage building upon the last to improve performance across multiple benchmarks. Results show consistent gains in average scores as the model advances through grounding, planning, and agentic interaction phases, particularly on complex tasks like GAIA and BrowseComp-ZH. This structured training strategy effectively bridges the gap between understanding and action, enabling more robust and goal-consistent behavior in deep search scenarios.

The authors evaluate their multimodal search agent, REDSearcher-MM, across diverse benchmarks and find it outperforms both proprietary and open-source baselines, particularly on complex tasks requiring visual grounding and long-horizon reasoning. Results show consistent gains after reinforcement learning, with improved efficiency in tool usage and stronger performance on challenging multimodal benchmarks like MM-BrowseComp and LiveVQA. The model also demonstrates robust transferability, maintaining strong results on text-only tasks despite being optimized for multimodal inputs.

The authors use a 30B-parameter model with context management to achieve state-of-the-art performance among open-source agents, outperforming larger proprietary models on key benchmarks including GAIA. Results show that their approach delivers superior deep research capabilities through efficient tool use and multimodal reasoning, even when compared to significantly larger baselines. Reinforcement learning further enhances performance by refining search efficiency and reducing redundant tool calls without sacrificing accuracy.

The authors use a 30B-parameter model with context management to achieve state-of-the-art performance among open-source agents, outperforming larger proprietary models on complex reasoning benchmarks like GAIA. Results show that progressive mid-training stages and reinforcement learning significantly improve long-horizon search efficiency, reducing tool calls while maintaining or increasing accuracy. The model also demonstrates strong multimodal search capabilities, effectively integrating visual and textual evidence across diverse benchmarks.

The authors use REDSearcher to evaluate performance across multiple challenging benchmarks, including BrowseComp, GAIA, and HLE, comparing it against both open-source and proprietary models. Results show that REDSearcher achieves competitive or superior scores relative to larger proprietary systems, particularly excelling on GAIA, which tests complex agentic reasoning. The model’s strong performance is attributed to its architecture and training methodology, including context management and reinforcement learning, which enhance efficiency and long-horizon task execution.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
REDSearcher: 장기 호라이즌 검색 에이전트를 위한 확장 가능하고 비용 효율적인 프레임워크 | 문서 | HyperAI초신경