4 hours ago

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo

Table of Contents

Abstract

Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra

One-sentence Summary

Researchers from multiple institutions propose AORCHESTRA, a framework-agnostic system using dynamic agent tuples (Instruction, Context, Tools, Model) to auto-generate specialized executors, reducing human effort and enabling Pareto-efficient performance-cost trade-offs, outperforming baselines by 16.28% on GAIA, SWE-Bench, and Terminal-Bench.

Key Contributions

AORCHESTRA introduces a unified, framework-agnostic agent abstraction as a 4-tuple 〈Instruction, Context, Tools, Model〉, enabling dynamic on-demand creation of specialized sub-agents tailored to each subtask’s requirements.
The system’s central orchestrator automatically concretizes this tuple at each step by curating context, selecting tools and models, and spawning executors—reducing human engineering and supporting plug-and-play integration of diverse agents.
Evaluated on GAIA, SWE-Bench, and Terminal-Bench with Gemini-3-Flash, AORCHESTRA achieves a 16.28% relative improvement over the strongest baseline, demonstrating superior performance in complex, long-horizon tasks while enabling controllable cost-performance trade-offs.

Introduction

The authors leverage a dynamic, framework-agnostic agent abstraction—defined as a 4-tuple (Instruction, Context, Tools, Model)—to enable on-demand creation of specialized sub-agents for complex, long-horizon tasks. Prior systems either treated sub-agents as static roles (requiring heavy engineering and lacking adaptability) or as isolated context threads (ignoring capability specialization), limiting performance in open-ended environments. AORCHESTRA’s orchestrator automatically concretizes this tuple at each step, selecting tools, models, and context to spawn tailored executors, reducing human effort while enabling cost-performance trade-offs. Evaluated on GAIA, SWE-Bench, and Terminal-Bench, it delivers a 16.28% relative improvement over baselines when paired with Gemini-3-Flash, and supports learning-based optimization for even better efficiency.

Dataset

The authors use three benchmark datasets for evaluation: GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified. GAIA contains 165 tasks testing tool-augmented, multi-step reasoning; Terminal-Bench 2.0 includes 89 real-world command-line workflows (70 sampled for cost); SWE-Bench-Verified has 500 software engineering tasks (100 sampled), all verified via test execution and human screening.
Each task is structured with an instruction template including repository and instance ID fields. The agent system breaks down complex queries—like identifying 2015 Met exhibition zodiac figures—into delegated subtasks with detailed instructions, context, and model assignments.
The model uses a recursive delegation strategy: initial attempts identify the exhibition and accession numbers; later attempts refine object identification and hand visibility checks; final decisions synthesize findings into a single answer (e.g., “11” visible hands).
Third-party APIs (web search, sandboxed environments) are accessed only via a controlled tool interface. Responses are cached with metadata (URLs, timestamps, queries) to ensure reproducibility and consistent evaluation.

Method

The authors leverage a unified, orchestrator-centric architecture called AORCHESTRA to solve complex, long-horizon agentic tasks. The core innovation lies in treating sub-agents not as static roles or isolated threads, but as dynamically creatable executors parameterized by a four-tuple interface: $(I, C, T, M)$ , where $I$ is the task instruction, $C$ is the curated context, $T$ is the tool set, and $M$ is the underlying model. This abstraction decouples working memory (instruction and context) from capability (tools and model), enabling on-demand specialization for each subtask.

Refer to the framework diagram, which illustrates the orchestrator’s iterative process. The system begins with a user’s task, which the Orchestrator Agent decomposes into subtasks. At each step $t$ , the Orchestrator samples an action $a_t$ from its restricted action space $\mathcal{A}_{\mathrm{AORCHESTRA}} = \{ \mathrm{Delegate}(\Phi_t), \mathrm{Finish}(y) \}$ . If the action is Delegate, the Orchestrator instantiates a Dynamic SubAgent with the specified four-tuple $\Phi_t = (I_t, C_t, T_t, M_t)$ . This sub-agent executes the subtask using model $M_t$ , restricted to tool set $T_t$ , and conditioned only on $(I_t, C_t)$ . Upon completion, it returns a structured observation $o_t$ —typically including a result summary, artifacts, and error logs—which is integrated into the next system state $s_{t+1}$ via the state-transition function $\delta(s_t, a_t, o_t)$ . The process repeats until the Orchestrator selects Finish, outputting the final answer $y$ .

The Orchestrator’s design explicitly separates orchestration from execution: it never directly performs environment actions. Instead, it delegates all execution to dynamically instantiated sub-agents. This enables fine-grained control over context, tools, and model selection per subtask. For instance, the Orchestrator can curate context by filtering the full history to provide only the most relevant details, configure tools by selecting a minimal required subset from a pool, and match model capabilities to subtask demands—such as selecting a cheaper model for simple tasks or a more capable one for complex reasoning.

The framework also supports two complementary learning paradigms to improve the Orchestrator’s policy $\pi_\theta(a_t | s_t)$ . First, supervised fine-tuning (SFT) distills expert trajectories to improve task orchestration—enhancing subtask decomposition and the synthesis of $(I_t, C_t, T_t)$ . Second, iterative in-context learning optimizes cost-aware orchestration without updating model weights. Here, the Orchestrator’s instruction $I^{\text{main}}$ is treated as a learnable object. After rolling out trajectories $\tau_k$ , an optimizer analyzes performance and cost metrics to propose prompt edits $\Delta I$ , updating the instruction as $I_{k+1}^{\text{main}} = \mathsf{OPTIMIZE}(I_k^{\text{main}}, \tau_k, \mathsf{Perf}(\tau_k), \mathsf{Cost}(\tau_k))$ . This loop aims to discover Pareto-efficient trade-offs between performance and cost, such as selecting cheaper models for non-critical subtasks.

The sub-agent abstraction is implementation-agnostic, allowing diverse internal designs—from a simple ReAct loop to a mini-SWE agent—while maintaining a consistent interface for the Orchestrator. This flexibility, combined with explicit capability control and learnable orchestration, enables AORCHESTRA to achieve strong training-free performance and adapt to cost-performance trade-offs across benchmarks.

Experiment

AORCHESTRA consistently outperforms baseline systems across GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified, achieving significant gains in pass@1 and pass@3 metrics, particularly with Gemini-3-Flash as orchestrator.
Context sharing between orchestrator and sub-agents proves critical: curated, task-relevant context improves performance over no context or full context inheritance, reducing noise and preserving execution fidelity.
Orchestration is a learnable skill: fine-tuning a smaller model (Qwen3-8B) for task delegation boosts performance, and in-context learning enables cost-aware model routing that improves both accuracy and efficiency.
Plug-and-play sub-agents demonstrate framework robustness: AORCHESTRA maintains strong performance when swapping sub-agent backends, confirming modular design and implementation independence.
The system exhibits strong long-horizon reasoning and error recovery: it successfully resolves complex, multi-attempt tasks by iteratively refining hypotheses, propagating intermediate findings, and adapting based on sub-agent feedback.

The authors evaluate context-sharing strategies in their agent framework by comparing No-Context, Full-Context, and their proposed method. Results show that explicitly curating and passing relevant context to sub-agents significantly improves performance across all difficulty levels, achieving the highest average score. This design avoids the pitfalls of missing critical cues or introducing noise from irrelevant history.

The authors use a hierarchical agent framework where a main orchestrator delegates tasks to specialized sub-agents, achieving higher accuracy than baseline systems across multiple benchmarks. Results show that explicitly curating and passing context to sub-agents improves performance, and that fine-tuning or in-context learning of the orchestrator further enhances both accuracy and cost efficiency. The system also demonstrates robustness by maintaining strong performance even when using smaller models or different sub-agent backends.

The authors use an orchestrator framework that dynamically delegates tasks to specialized sub-agents, achieving higher overall accuracy than standalone baselines across difficulty levels. Results show that combining the orchestrator with different sub-agent styles improves performance, particularly on medium-difficulty tasks, while maintaining strong results on easy tasks. The framework demonstrates flexibility in sub-agent selection, with ReAct-style sub-agents yielding the best overall accuracy.

The authors use AORCHESTRA, a multi-agent orchestration framework, to outperform baseline systems across three diverse benchmarks: GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified. Results show consistent gains in pass@1 and pass@3 metrics, with the largest improvements observed on Terminal-Bench 2.0 and GAIA, indicating strong generalization across task types and model backbones. The system’s design—enabling context-aware sub-agent delegation, learnable orchestration, and plug-and-play execution—contributes to its robust performance and cost-efficiency.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 hours ago

Agent

Intelligent Question Answering

LLM

Method/Architecture

Natural Language Processing

Task/Problem

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo

Table of Contents

Abstract

One-sentence Summary

Key Contributions

AORCHESTRA introduces a unified, framework-agnostic agent abstraction as a 4-tuple 〈Instruction, Context, Tools, Model〉, enabling dynamic on-demand creation of specialized sub-agents tailored to each subtask’s requirements.
The system’s central orchestrator automatically concretizes this tuple at each step by curating context, selecting tools and models, and spawning executors—reducing human engineering and supporting plug-and-play integration of diverse agents.
Evaluated on GAIA, SWE-Bench, and Terminal-Bench with Gemini-3-Flash, AORCHESTRA achieves a 16.28% relative improvement over the strongest baseline, demonstrating superior performance in complex, long-horizon tasks while enabling controllable cost-performance trade-offs.

Introduction

Dataset

The authors use three benchmark datasets for evaluation: GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified. GAIA contains 165 tasks testing tool-augmented, multi-step reasoning; Terminal-Bench 2.0 includes 89 real-world command-line workflows (70 sampled for cost); SWE-Bench-Verified has 500 software engineering tasks (100 sampled), all verified via test execution and human screening.
Each task is structured with an instruction template including repository and instance ID fields. The agent system breaks down complex queries—like identifying 2015 Met exhibition zodiac figures—into delegated subtasks with detailed instructions, context, and model assignments.
The model uses a recursive delegation strategy: initial attempts identify the exhibition and accession numbers; later attempts refine object identification and hand visibility checks; final decisions synthesize findings into a single answer (e.g., “11” visible hands).
Third-party APIs (web search, sandboxed environments) are accessed only via a controlled tool interface. Responses are cached with metadata (URLs, timestamps, queries) to ensure reproducibility and consistent evaluation.

Method

Experiment

AORCHESTRA consistently outperforms baseline systems across GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified, achieving significant gains in pass@1 and pass@3 metrics, particularly with Gemini-3-Flash as orchestrator.
Context sharing between orchestrator and sub-agents proves critical: curated, task-relevant context improves performance over no context or full context inheritance, reducing noise and preserving execution fidelity.
Orchestration is a learnable skill: fine-tuning a smaller model (Qwen3-8B) for task delegation boosts performance, and in-context learning enables cost-aware model routing that improves both accuracy and efficiency.
Plug-and-play sub-agents demonstrate framework robustness: AORCHESTRA maintains strong performance when swapping sub-agent backends, confirming modular design and implementation independence.
The system exhibits strong long-horizon reasoning and error recovery: it successfully resolves complex, multi-attempt tasks by iteratively refining hypotheses, propagating intermediate findings, and adapting based on sub-agent feedback.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo1 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo1 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo1 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo

Jianhao Ruan Zhihao Xu Yiran Peng Fashen Ren Zhaoyang Yu Xinbing Liang Jinyu Xiang Bang Liu Chenglin Wu Yuyu Luo