HyperAIHyperAI

Command Palette

Search for a command to run...

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Pu Ning Quan Chen Kun Tao Xinyu Tang Tianshu Wang Qianggang Cao Xinyu Kong Zujie Wen Zhiqiang Zhang Jun Zhou

Abstract

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

One-sentence Summary

SearchSwarm-30B-A3B is a model trained via supervised fine-tuning on harness-generated trajectories to internalize delegation intelligence for long-horizon deep research, achieving 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among models of comparable scale.

Key Contributions

  • A specialized execution harness structures multi-agent workflows by guiding task decomposition, subagent briefing, and citation-grounded result integration while constraining subagents to return only summarized outputs. This architecture shields the main agent from raw tool responses, effectively preserving finite context capacity for iterative exploration.
  • Harness-generated trajectories are extracted and formatted into supervised fine-tuning data to internalize delegation intelligence directly into model weights. This data synthesis pipeline addresses the scarcity of naturally occurring delegation examples in open-source training corpora.
  • The resulting SearchSwarm-30B-A3B model achieves state-of-the-art performance among comparable-scale models on BrowseComp and BrowseComp-ZH. Evaluation results further demonstrate that the trained delegation patterns generalize effectively to single-agent settings and open-ended research tasks.

Introduction

Large language models are increasingly deployed as autonomous agents for complex, long-horizon tasks like deep research, where information demands quickly outpace finite context windows. This bottleneck makes efficient context management essential for maintaining model performance and scalability. While active delegation architectures offer a promising alternative to passive summarization techniques, the open-source community lacks a complete training recipe, and naturally occurring text rarely contains the explicit multi-agent coordination data required to teach delegation intelligence. To bridge this gap, the authors leverage a custom inference harness to guide a main agent through structured task decomposition and detailed subagent briefing, then convert these successful trajectories into supervised fine-tuning data. This process internalizes delegation intelligence directly into model weights, producing SearchSwarm-30B-A3B, which achieves state-of-the-art results among similarly sized models while fully open-sourcing the harness, training data, and weights for future research.

Dataset

  • Dataset Composition and Sources: The authors construct the training corpus by executing deep research tasks on queries sourced from the open-source RedSearcher and OpenSeeker datasets. They record complete execution trajectories that capture chain-of-thought reasoning, tool invocations, and environment feedback.
  • Subset Details and Filtering Rules: Data collection follows two configurations. The first runs a single model as both main and subagent, preserving paths from both roles. The second pairs a stronger main agent with a weaker subagent, retaining only the main agent trajectories to encourage tighter task decomposition and verification. Filtering keeps main agent paths only when they yield correct final answers and retains subagent paths exclusively when paired with a correct main trajectory. The authors also downsample overly short subagent clips and discard samples featuring repeated tool calls, hallucinated citations, or tool misuse like web scraping through Python interpreters.
  • Training Usage and Processing: Trajectories from both configurations are mixed into a single training set. The authors fine-tune the base model using next-token prediction with strict environment masking. The loss function is computed solely over the model's generated outputs, while all environment returns are masked to prevent the model from memorizing external feedback.
  • Context Management and Cropping Strategy: The main agent context window is capped at 128K tokens and the subagent window at 64K tokens. When a trajectory nears these limits, the system prompts the model to generate a final answer immediately. Rather than dropping these sequences, the authors preserve them so the model learns to perform well under forced-answer conditions during inference. Additionally, subagent dispatches are carefully crafted to include only established context, ensuring they focus on specific sub-questions without repeating settled ground.

Method

The SearchSwarm framework operates under a main-distributes, sub-executes paradigm, where a central main agent orchestrates complex research tasks by delegating subtasks to independent subagents. This architecture is designed to manage context efficiently and enable high-quality reasoning through structured delegation. The main agent, equipped with a comprehensive tool set including search, visit, Python interpreter, and Google Scholar, interacts with the environment through a sequence of thoughts, actions, and observations, following the ReAct framework. At each step, the agent reasons about the current state, selects an action, and processes the resulting observation. When a subtask is identified, the main agent invokes the call_sub_agent tool, which dispatches a brief to a subagent. The brief contains a subtask description along with contextual information such as the task's relevance, prior findings, and unresolved questions, ensuring the subagent operates with sufficient background to contribute effectively.

As shown in the figure above, the main agent and subagents operate in separate contexts, with the subagent receiving only the brief and returning a condensed report. This separation ensures that the main agent’s context remains uncluttered, preserving its capacity for high-level coordination and judgment. The subagent, equipped with the same set of tools as the main agent, conducts its own multi-turn interactions to gather evidence and produce a report. The report is required to include inline citations for every significant claim, allowing the main agent to verify the reliability of the findings without access to the subagent’s intermediate steps. The main agent then integrates the report into its reasoning process, continuing the iterative cycle of thought and action until a final answer is generated. This approach enables the system to handle long-horizon tasks by effectively compressing subtask execution into a single report, thereby managing context growth while maintaining traceability and coherence.

Experiment

The experiments evaluate a two-agent delegation framework across multiple long-horizon and open-ended research benchmarks, comparing it against leading closed-source, open-source, and lightweight models. Results demonstrate that the proposed harness and training data substantially enhance delegation intelligence, enabling a compact model to match or exceed much larger frontier systems. Ablation studies and cross-architecture tests confirm that the framework effectively elicits structured information gathering and synthesis while proving the high quality of the underlying training data. Furthermore, the acquired capabilities generalize robustly to single-agent configurations and open-ended research tasks, highlighting the method's versatility and the model's internalized problem-decomposition skills.

The authors present a model that achieves state-of-the-art performance among lightweight models on long-horizon research tasks, demonstrating strong competitiveness against larger models. The model's delegation mechanism enables effective context management, leading to improved results across multiple benchmarks, and the training data and harness design contribute to generalization beyond the delegation setting. SearchSwarm outperforms other models of similar scale and achieves results competitive with larger models across multiple benchmarks. The model's delegation mechanism enables efficient context management, with the main agent primarily orchestrating subagent calls for information gathering. The training data and harness design lead to generalization benefits, improving performance even in settings without delegation tools.

The authors compare their model, SearchSwarm, against a range of closed-source, open-source, and lightweight open-source models across multiple benchmarks. Results show that SearchSwarm achieves state-of-the-art performance among models of its scale and demonstrates strong competitiveness against larger models, particularly on long-horizon research tasks. The model also generalizes well to open-ended deep research settings, outperforming its base model and achieving high results without explicit training on such tasks. SearchSwarm achieves top performance among lightweight models and surpasses several larger models on key benchmarks. The model demonstrates strong generalization to open-ended research tasks, improving significantly over its base model. The main agent relies heavily on delegation, using the subagent tool for information gathering while handling verification and computation directly.

The authors evaluate their model, SearchSwarm, on open-ended deep research benchmarks and compare its performance against both closed-source and open-source systems. Results show that SearchSwarm achieves competitive performance, particularly excelling on ResearchQA and ScholarQA-v2, and ranks second among open-source models in average performance. The model outperforms its base model across all benchmarks, demonstrating strong generalization to long-form synthesis tasks. SearchSwarm achieves the second-highest average performance among open-source models on open-ended deep research benchmarks. SearchSwarm significantly outperforms its base model across all evaluated benchmarks, showing strong generalization to long-form synthesis tasks. SearchSwarm achieves top performance on ResearchQA and ScholarQA-v2, outperforming several strong open-source models.

The authors evaluate their model, SearchSwarm, on multiple benchmarks and compare it to various open-source and closed-source models. Results show that SearchSwarm achieves top performance among models at the 30B-A3B scale and competes with much larger models, indicating that effective delegation intelligence enables strong performance in long-horizon research tasks. The model's training data and harness design are effective in promoting intelligent delegation and generalizing capabilities to both single-agent and open-ended research settings. SearchSwarm achieves state-of-the-art performance among 30B-A3B scale models across all benchmarks. SearchSwarm competes with significantly larger models, demonstrating that delegation intelligence enables strong performance despite model size. The training data and harness design promote effective delegation and generalize to single-agent and open-ended research settings.

The authors evaluate SearchSwarm across multiple long-horizon and open-ended research benchmarks, comparing it against closed-source, open-source, and similarly sized models to validate its competitive efficiency and generalization capabilities. Results indicate that the model achieves state-of-the-art performance within its parameter scale while remaining highly competitive with significantly larger systems, demonstrating that effective delegation intelligence can offset size limitations. The experiments further validate that the delegation mechanism successfully manages context by orchestrating subagent information retrieval, which consistently drives improvements over the base architecture. Additionally, the tailored training data and harness design prove effective at promoting intelligent delegation, enabling robust generalization to both single-agent and open-ended research settings without explicit task-specific training.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp