HyperAIHyperAI

Command Palette

Search for a command to run...

REDSearcher: إطار قابل للتوسع وفعال من حيث التكلفة لوكالات البحث ذات الأفق الطويل

الملخص

تنتقل النماذج اللغوية الكبيرة من كونها محركات معرفية عامة إلى حلّالات لمشاكل العالم الحقيقي، ومع ذلك تبقى عملية تحسينها لمهام البحث العميقة تحديًا كبيرًا. ويكمن العائق المركزي في الندرة الشديدة للمسارات عالية الجودة المتعلقة بالبحث والإشارات المكافئة، التي تنشأ نتيجة صعوبة بناء المهام ذات الأفق الطويل القابلة للتوسع، بالإضافة إلى التكلفة العالية للتجريبات التي تعتمد على التفاعل المكثف ووظائف الأدوات الخارجية. ولحل هذه التحديات، نقترح إطارًا موحدًا يُسمى REDSearcher، يُصمم بشكل متكامل لمهام توليد المهام المعقدة، والتدريب المتوسط، والتدريب اللاحق، بهدف تحسين وكفاءة الوكلاء في البحث على نطاق واسع. وبشكل خاص، يُقدّم REDSearcher التحسينات التالية: (1) نُصِف عملية توليد المهام كمُشكلة تحسين مزدوجة القيود، حيث يتم التحكم بدقة في صعوبة المهمة من خلال البنية الشبكية وتوزيع الأدلة، مما يمكّن من إنشاء مهام معقدة وعالية الجودة بسهولة قابلة للتوسع. (2) نُقدّم استعلامات مدعومة بالأدوات لتحفيز الاستخدام التفاعلي للأدوات بدلًا من التذكّر السلبي. (3) خلال مرحلة التدريب المتوسط، نُعزّز القدرات الأساسية الأساسية مثل المعرفة، والتخطيط، واستدعاء الوظائف، مما يقلل بشكل كبير من تكلفة جمع المسارات عالية الجودة لمرحلة التدريب اللاحقة. (4) نُنشئ بيئة محاكاة محلية تتيح تكرارًا سريعًا وقليل التكلفة للخوارزميات في تجارب التعلم بالتعزيز. وقد حقق نهجنا أداءً متقدمًا على مستوى الدولة (state-of-the-art) في معايير البحث النصية فقط والمتعددة الوسائط لوكالات البحث. ولتمكين الأبحاث المستقبلية حول وكالات البحث ذات الأفق الطويل، سنُطلق 10,000 مسار بحث نصي معقد عالي الجودة، و5,000 مسار متعدد الوسائط، و1,000 مجموعة استعلام تعلم بالتعزيز النصي، بالإضافة إلى الكود ونقاط التحقق (checkpoints) النموذجية.

One-sentence Summary

The REDSearcher team proposes a unified framework for optimizing search agents by co-designing task synthesis, mid-training, and post-training, using graph-constrained task generation and tool-augmented queries to reduce reliance on costly real-world rollouts, achieving SOTA across text and multimodal benchmarks while releasing 16K trajectories and code.

Key Contributions

  • REDSearcher addresses the scarcity of high-quality search trajectories by synthesizing complex tasks via dual constraints—graph treewidth for logical complexity and evidence dispersion—enabling scalable generation of long-horizon reasoning problems that demand iterative planning and cross-document synthesis.
  • It introduces tool-augmented queries and mid-training reinforcement of core capabilities (knowledge, planning, function calling) to promote proactive tool use and reduce the cost of collecting high-quality trajectories, while a local simulated environment enables rapid, low-cost RL experimentation.
  • Evaluated on text-only and multimodal benchmarks, REDSearcher achieves state-of-the-art performance and releases 10K text, 5K multimodal search trajectories, and 1K RL queries to support future research on deep search agents.

Introduction

The authors leverage large language models to tackle long-horizon search tasks—where agents must plan, retrieve, and synthesize information across multiple steps and sources—but note that prior work struggles with sparse high-quality training data and prohibitive costs from live tool interactions. Existing datasets often lack structural complexity and rely on simplistic, linear reasoning, while real-world search demands handling cyclic or fully coupled constraints that require maintaining entangled hypotheses. REDSearcher addresses this by co-designing task synthesis, mid-training, and reinforcement learning: it generates complex tasks using treewidth-guided graph topology and evidence dispersion, injects tool-augmented queries to promote proactive tool use, strengthens core subskills early to reduce rollout costs, and deploys a simulated environment for rapid RL iteration. The result is a scalable, cost-efficient framework that achieves state-of-the-art performance on both text and multimodal search benchmarks, backed by public releases of 16K high-quality trajectories and training artifacts.

Dataset

  • The authors construct a synthetic dataset to train deep search agents capable of handling multi-hop, ambiguous, and non-linear queries—tasks that demand iterative tool use and evidence synthesis, which existing open-source datasets lack.

  • The dataset is generated via a scalable, controllable synthesis pipeline that combines signals from local knowledge bases and cached webpages, intentionally increasing difficulty through fuzzing and complex constraints.

  • To ensure quality and challenge, the authors apply a five-stage verifier pipeline:

    1. LLM solver pre-filter: removes instances solvable without tools.
    2. Retrievability check: filters out questions whose answers don’t appear in top-50 search snippets.
    3. Hallucination/inconsistency check: uses an LLM verifier to detect contradictions between evidence and question-answer pairs.
    4. Agent rollout verification: runs strong tool-using agents across multiple rollouts; keeps instances where at least one rollout succeeds and records pass rate as confidence.
    5. Answer uniqueness check: discards instances with plausible alternative answers to reduce ambiguity.
  • A quality study confirms 85%+ of 500 human-verified instances are logically consistent and grounded; a strong model (DeepSeek-V3.2) achieves ~40% accuracy, while humans solve 47% within 30 minutes—validating the dataset’s realistic difficulty.

  • For training, the authors generate multi-turn tool-calling data simulating ReACT loops using LLMs to create tool sets, queries, and environmental feedback—avoiding costly real API calls.

  • Long-horizon interaction trajectories (up to 128K context) are synthesized using a local simulated web environment built from Wikipedia and web crawl dumps, ensuring solvability and enabling training on complex, multi-step search tasks.

  • The dataset includes highly intricate, real-world-inspired questions requiring cross-domain reasoning, such as identifying record pressing plants, healthcare facilities, racing events, and historical sites—all grounded in synthesized but plausible evidence.

  • No cropping is applied; metadata is constructed implicitly through the synthesis pipeline, embedding grounding signals (e.g., KB triples, cached passages) and confidence metrics (e.g., agent pass rates) for each instance.

Method

The authors leverage a structured, multi-phase training framework to develop REDSearcher, a tool-augmented agent capable of deep, long-horizon search across text and multimodal domains. The architecture is built upon a scalable task synthesis pipeline, a two-stage mid-training regimen, and a post-training phase that combines supervised fine-tuning with reinforcement learning. Each component is designed to address the sparsity of supervision and the computational cost of real-world interaction.

The core of the method begins with the scalable task synthesis pipeline, which generates complex, verifiable QA pairs by constructing reasoning graphs with controlled structural and distributional complexity. As shown in the figure below, the pipeline initiates with a seed entity set drawn from Wikipedia, from which a directed acyclic graph (DAG) is built using both structured Wikidata relations and web-based hyperlink traversal. This graph is then enriched by an LLM-driven agent to introduce cycles and interlocking constraints, increasing the treewidth and forcing the solver to maintain multiple hypotheses. Subgraph sampling extracts multiple reasoning contexts from each master graph, and an LLM generates natural-language questions anchored to these topologies. A critical innovation is the tool-enforced query evolution: static entities are replaced with operational constraints (e.g., routing queries or citation-based lookups) that require external tool invocation, ensuring that successful completion is contingent on tool use.

To ensure quality and difficulty, a verifier pipeline filters out solvable instances. The LLM solver checks for hallucinations and API retrievability, while an agent solver performs n-rollouts to validate answer consistency. Only QA pairs that survive this multi-stage filtering are retained for training. For multimodal tasks, the pipeline injects visual constraints by anchoring intermediate nodes to images and enforcing cross-modal dependencies, ensuring that visual understanding is necessary for task completion.

The training process is divided into two major phases: mid-training and post-training. As illustrated in the framework diagram, the model begins from an open-source LLM checkpoint and progresses through atomic capability acquisition (Stage 1, 32K context) and composite capability development (Stage 2, 128K context) during mid-training. This is followed by agentic supervised fine-tuning (SFT) and reinforcement learning (RL) in the post-training phase.

Mid-training is further decomposed into two phases. Phase I focuses on internal cognitive optimization: intent-anchored grounding, which teaches the model to extract relevant facts from noisy web pages under specific query intents, and hierarchical planning, which enables decomposition of ambiguous goals into concrete subtasks. Phase II introduces external environmental interaction, where the model learns to execute tool calls and maintain state across long-horizon trajectories. This staged approach allows the model to warm-start with foundational skills before engaging in costly real-world rollouts.

In post-training, the model undergoes supervised fine-tuning on high-quality ReAct-style trajectories generated in real-world environments using five tool interfaces: search, visit, Python interpreter, Google Scholar, and Google Maps. The SFT objective masks environment observations to prevent gradient contamination. Subsequently, agentic reinforcement learning is applied using GRPO with verifiable rewards. The reward is binary (0/1) based on answer correctness, and advantages are normalized within groups of rollouts per question to stabilize training. To accelerate experimentation, a functionally equivalent simulation environment is used during RL, which mimics real APIs while ensuring evidence completeness and injecting realistic noise. The simulation environment is built from cached web data and includes URL obfuscation to prevent model bias. Asynchronous rollouts and a two-tier load balancing strategy are employed to handle the computational demands of long trajectories.

The entire framework is designed to scale efficiently: task synthesis reuses graphs to amortize LLM costs, mid-training avoids real-world interaction until necessary, and RL leverages a curated, agent-verified query set to ensure clean learning signals. The result is a deep-search agent that can iteratively acquire evidence, maintain hypotheses, and synthesize information across multiple sources and modalities.

Experiment

  • REDSearcher sets a new state-of-the-art among open-source 30B-parameter agents, outperforming both open and proprietary models on complex benchmarks like GAIA, demonstrating superior parameter efficiency and deep research capability.
  • Mid-training stages progressively enhance performance: Stage I improves grounding and planning, especially on GAIA; Stage II enables robust tool use and long-horizon execution, significantly boosting performance on BrowseComp-ZH.
  • Reinforcement learning further refines capabilities, improving overall scores and reducing tool usage by 10.4% without sacrificing accuracy, indicating more efficient and strategic search behavior.
  • Tool-use analysis reveals REDSearcher relies minimally on parametric knowledge, excelling only when tools are enabled—highlighting strong planning, evidence synthesis, and iterative reasoning over memorization.
  • Multimodal experiments show strong performance across vision-language benchmarks, outperforming large proprietary models and a Qwen3-VL baseline, with capabilities transferring well to text-only tasks.
  • Analysis of tool usage patterns shows adaptive behavior: simpler tasks require fewer turns, while complex ones involve more decomposition, reflection, and verification; RL training reduces unnecessary search steps, especially on easier benchmarks.

The authors use a staged mid-training approach to progressively enhance the model’s agentic capabilities, with each stage building upon the last to improve performance across multiple benchmarks. Results show consistent gains in average scores as the model advances through grounding, planning, and agentic interaction phases, particularly on complex tasks like GAIA and BrowseComp-ZH. This structured training strategy effectively bridges the gap between understanding and action, enabling more robust and goal-consistent behavior in deep search scenarios.

The authors evaluate their multimodal search agent, REDSearcher-MM, across diverse benchmarks and find it outperforms both proprietary and open-source baselines, particularly on complex tasks requiring visual grounding and long-horizon reasoning. Results show consistent gains after reinforcement learning, with improved efficiency in tool usage and stronger performance on challenging multimodal benchmarks like MM-BrowseComp and LiveVQA. The model also demonstrates robust transferability, maintaining strong results on text-only tasks despite being optimized for multimodal inputs.

The authors use a 30B-parameter model with context management to achieve state-of-the-art performance among open-source agents, outperforming larger proprietary models on key benchmarks including GAIA. Results show that their approach delivers superior deep research capabilities through efficient tool use and multimodal reasoning, even when compared to significantly larger baselines. Reinforcement learning further enhances performance by refining search efficiency and reducing redundant tool calls without sacrificing accuracy.

The authors use a 30B-parameter model with context management to achieve state-of-the-art performance among open-source agents, outperforming larger proprietary models on complex reasoning benchmarks like GAIA. Results show that progressive mid-training stages and reinforcement learning significantly improve long-horizon search efficiency, reducing tool calls while maintaining or increasing accuracy. The model also demonstrates strong multimodal search capabilities, effectively integrating visual and textual evidence across diverse benchmarks.

The authors use REDSearcher to evaluate performance across multiple challenging benchmarks, including BrowseComp, GAIA, and HLE, comparing it against both open-source and proprietary models. Results show that REDSearcher achieves competitive or superior scores relative to larger proprietary systems, particularly excelling on GAIA, which tests complex agentic reasoning. The model’s strong performance is attributed to its architecture and training methodology, including context management and reinforcement learning, which enhance efficiency and long-horizon task execution.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp