4달 전

Lisa Alazraki William F. Shen Yoram Bachrach Akhil Mathur

초록

작은 규모의 언어 모델은 에이전트형 AI에 대한 유망하고 비용 효율적인 접근법으로 점점 더 주목받고 있으며, 지지자들은 이러한 모델이 에이전트 워크플로우에 충분히 능력을 갖추고 있다고 주장한다. 그러나 단순한 작업에서는 작은 에이전트가 큰 모델과 거의 동등한 성능을 보일 수 있으나, 작업의 복잡성이 증가할 때 성능이 어떻게 확장되는지, 언제 큰 모델이 필수적인지, 그리고 장기적인 작업 부하에 대해 작은 에이전트를 어떻게 더 효과적으로 활용할 수 있는지에 대해서는 여전히 명확하지 않다. 본 연구에서는 작은 에이전트의 성능이 심층 탐색 및 코딩 작업과 같은 복잡한 과제에서 작업 복잡성에 따라 확장되지 않는다는 경험적 증거를 제시하며, ‘작업 효율을 위한 전략 경매(Strategy Auctions for Workload Efficiency, SALE)’라는 에이전트 프레임워크를 도입한다. SALE은 프리랜서 시장 구조를 영감으로 삼아 개발된 것으로, 에이전트들이 짧은 전략 계획을 제출하여 입찰하고, 체계적인 비용-가치 평가 메커니즘을 통해 평가되며, 공유된 경매 메모리(Shared Auction Memory)를 통해 지속적으로 개선된다. 이 구조는 별도의 라우터 모델을 훈련하지 않고도 각 작업에 맞는 에이전트 라우팅과 지속적인 자기 개선을 가능하게 한다. 다양한 복잡도를 가진 심층 탐색 및 코딩 작업에서 SALE은 가장 큰 에이전트에 대한 의존도를 53% 감소시키고, 전체 비용은 35% 낮추며, 최종 추적 실행 외에 거의 무시할 수 없는 추가 오버헤드로 가장 큰 에이전트보다 높은 pass@1 성능을 지속적으로 달성한다. 반면, 작업 설명에 의존하는 기존의 라우팅 기법들은 대부분 가장 큰 에이전트의 성능을 뛰어넘지 못하거나 비용 절감 효과를 내지 못하며, 종종 두 가지 문제가 동시에 발생한다. 이는 기존 라우터가 에이전트형 워크플로우에 부적합함을 시사한다. 이러한 결과는 작은 에이전트가 복잡한 작업에 충분하지 않을 수 있음에도 불구하고, 조율된 작업 분배와 실행 시 자기 개선을 통해 효과적으로 ‘확장’될 수 있음을 시사한다. 더 넓은 의미에서, 이는 개별 모델의 규모가 점점 더 커지는 것보다는, 다양한 에이전트를 효율적이고 적응 가능한 생태계로 조직하는 마켓 기반의 조율 메커니즘을 통해 성능 향상을 달성하는 시스템 수준의 관점이 에이전트형 AI의 발전 방향임을 시사한다.

One-sentence Summary

Researchers from Meta, Imperial College London, and the University of Cambridge propose SALE, a marketplace-inspired framework where small and large language models bid with strategic plans to handle complex agentic tasks; SALE reduces reliance on the largest model by 53% and overall cost by 35% while improving accuracy through test-time self-improvement and cost-value auctions.

Key Contributions

Small agents perform nearly on par with large ones on simple tasks but degrade significantly as complexity increases in deep search and coding workloads, a scaling gap empirically quantified here for the first time using real-world tasks paired with human solution times via the new HST-BENCH benchmark.
The paper introduces SALE, a marketplace-inspired framework where heterogeneous agents bid with strategic plans scored by cost-value metrics and refined through shared auction memory, enabling dynamic task routing and test-time self-improvement without requiring separate training or full model execution.
SALE reduces reliance on the largest agent by 53% and cuts overall cost by 35% while matching or exceeding its performance on complex tasks, outperforming static routers that fail to adapt or reduce cost, demonstrating that auction-based coordination can effectively “scale up” small agents through adaptive orchestration.

Introduction

The authors leverage small language models as cost-effective agents for complex workflows but find their performance degrades sharply as task complexity increases—particularly in deep search and coding—making them insufficient alone for long-horizon tasks. Prior routing methods either require running all models to completion (too expensive) or rely on static, trained classifiers that fail to adapt or scale with difficulty. Their main contribution is SALE, a marketplace-inspired framework where agents bid with short strategic plans, which are scored and refined via shared auction memory to enable dynamic, test-time routing and self-improvement—without training a separate router. SALE reduces reliance on the largest agent by 53%, cuts overall cost by 35%, and outperforms both single models and existing routers, demonstrating that coordinated heterogeneous agents can deliver better performance-cost trade-offs than scaling individual models alone.

Dataset

The authors use HST-BENCH, a human-timed evaluation dataset of 753 agentic tasks, to measure performance across deep search and coding domains. Here’s how they construct and use it:

Dataset Composition and Sources
HST-BENCH combines existing open-source benchmarks: SimpleQA, PopQA, HotpotQA, GAIA, Humanity’s Last Exam (HLE), MBPP, and LeetCode. They supplement coding with a custom multiple-choice set (Coding-MCQ) to better populate low-complexity bins.
Key Details per Subset
- Tasks are sampled from official test splits; invalid or unanswerable instances are discarded.
- HLE is restricted to expert-validated chemistry/biology questions.
- GAIA’s human solution times are reused from its validation split.
- LeetCode “Hard” tasks use published timing estimates (Siroš et al., 2024) due to annotation cost.
- Coding-MCQ includes short, conceptual questions targeting core programming knowledge (examples in Appendix A.4).
Complexity Binning and Annotation
- Human solution time (τ(t)) is annotated by 3+ expert annotators per task (CS graduates), using permitted tools only (e.g., browser, IDE).
- Times are filtered for correctness and outliers (>2 SD from mean), then averaged.
- Tasks are grouped into 5 non-overlapping bins based on τ(t): 0–0.1 min, 0.1–0.5 min, 0.5–2.5 min, 2.5–12.5 min, 12.5–60 min.
- Bins follow a geometric progression (5× spacing) to balance sample sizes across 3 orders of magnitude in solution time.
- Inter-annotator reliability is high (Krippendorff’s α = 0.86).
Use in Model Evaluation
- The test split contains 753 tasks; separate dev sets (68 for search, 88 for coding) are used for tuning.
- Model performance is analyzed per complexity bin to study how scaling affects agentic capability.
- The authors use Qwen3 models (4B–32B) for controlled scaling experiments, isolating size effects while holding architecture and training constant.
- Evaluation metrics include success rate and cost (token-based), with future extensions planned for tool pricing.
Processing and Metadata
- Metadata includes source dataset, complexity bin, and aggregated human solution time.
- Table 3 (in paper) shows dataset contribution per bin — low bins dominated by SimpleQA/Coding-MCQ, high bins by GAIA, HLE, and LeetCode Hard.
- No cropping or data augmentation is applied; tasks are used in original form with standardized timing protocols.

Method

The authors leverage a novel auction-based framework, SALE, to dynamically route tasks to the most suitable agent within a heterogeneous pool by treating strategic plans as competitive bids. The architecture is structured around four core stages: strategy bidding, cost and value assignment, winning bid selection, and strategy refinement from auction memory. Each stage is designed to optimize the trade-off between computational cost and expected performance, with the entire process governed by learned scoring weights.

In the strategy bidding phase, each agent $a_i$ in the pool $\mathcal{A}$ generates a strategy $s_{t,i}$ conditioned on the task $t$ and environment $E$ . These strategies are interpreted as bids, encoding the agent’s intended approach, including decomposition, tool selection, and anticipated challenges. The authors emphasize that these plans are not merely execution artifacts but serve as informative signals for model selection.

Refer to the framework diagram, which illustrates how multiple agents submit their strategies in parallel for a given task. The strategies are then passed to the cost and value assignment module. Cost $C_{t,i}$ is estimated as a function of the agent’s token price $\pi(a_i)$ , the length of the strategy $|s_{t,i}|$ , and a tunable weight $w_c$ :

C_{t,i} = w_c \cdot \pi(a_i) \cdot |s_{t,i}|.

This formulation leverages strategy length as a proxy for total inference cost and execution risk, grounded in prior work showing that longer plans correlate with higher failure rates and token consumption.

Value $V_{t,i}$ is computed as a weighted combination of intrinsic and extrinsic signals:

V_{t,i} = w_h \cdot H(s_{t,i}) + \sum_{a_j \in \mathcal{A}} w_j \cdot \gamma_j(s_{t,i}),

where $H(s_{t,i})$ is the normalized entropy of the strategy, capturing informational richness, and $\gamma_j(s_{t,i})$ represents peer and self-assessments from a jury of agents. The entropy term is motivated by evidence that higher-entropy reasoning correlates with reduced redundancy and improved planning outcomes. The jury scoring mechanism, which includes both self- and peer-evaluation, is designed to enhance judgment reliability, as supported by literature on LLM juries.

The winning bid selection module employs a min-max optimization to learn scoring weights $w = (w_c, w_h, \{w_j\})$ that minimize the worst-case cost-minus-value $C_{t,i} - V_{t,i}$ across a training set of tasks. This objective ensures robustness by guarding against poor assignments on any single task. At inference time, for each new task $t$ , the system assigns binary variables $x_{t,i} \in \{0,1\}$ to select exactly one agent, minimizing:

z_t = \sum_{a_i \in \mathcal{A}} x_{t,i} \left( C_{t,i} - V_{t,i} \right),

which reduces to selecting the agent with the lowest $C_{t,i} - V_{t,i}$ .

To further enhance cost-efficiency, the framework incorporates a strategy refinement mechanism powered by a long-term memory bank $\mathcal{M}$ . After each auction, all submitted strategies—winning and losing—are stored alongside task outcomes. For a new task $t$ , if the provisional winner is not the cheapest agent, cheaper agents retrieve contrastive pairs of past winning and losing strategies from memory, conditioned on task similarity. These agents then generate refined strategies $s_{t,i}^r$ using a contrastive prompt template, which are re-evaluated for cost and value. If any refined bid improves the cost-value trade-off, it replaces the provisional winner; otherwise, the original selection is retained. This opportunistic refinement ensures that only cost-efficient agents incur the overhead of memory retrieval and re-planning, preserving the system’s overall efficiency.

The entire auction mechanism, including jury scoring and refinement, incurs minimal inference overhead—on the order of a few hundred tokens—compared to the tens of thousands or millions of tokens typically consumed during final execution, making the overhead negligible relative to total compute.

Experiment

Smaller, cheaper agents perform nearly as well as larger ones on simple tasks but fall sharply behind as task complexity increases, showing strong stratification by model size and cost.
Larger agents do not inherently solve complex tasks more token-efficiently; they often use similar or more tokens than smaller agents, failing to offset their higher per-token cost.
SALE, a strategy-based auction system with memory-driven self-refinement, consistently outperforms single agents and existing routers by dynamically assigning tasks to the most cost-effective agent while improving or matching accuracy.
SALE reduces cost by 17–53% across task bins while improving or maintaining pass@1, pushing the performance-cost Pareto frontier beyond any single model or baseline router.
Smaller agents (4B, 8B) handle a significant share of workload even on complex tasks, and their contribution grows over time as auction feedback enables progressive refinement of their strategies.
Shapley value analysis confirms that smaller agents contribute meaningfully to the ensemble through jury scoring and memory, even when rarely selected for final execution.
Qualitative analysis reveals complementary failure modes: larger agents often over-engineer or skip verification, while smaller agents rely more on tools and checks, enabling SALE to exploit strategic differences at bid time.
Ablations show that all components of SALE’s cost-value function and jury mechanism are essential; removing any degrades performance or efficiency, and jury diversity provides robust, low-overhead gains.
SALE’s routing is conservative, favoring accuracy over cost—often over-escalating to larger agents on easy tasks—but rarely under-escalates, preserving correctness where it matters most.

Removing any single judge from the jury in the SALE system reduces overall pass@1 performance, confirming that each agent contributes unique signal to the ensemble. The full jury consistently outperforms any single judge configuration, with smaller judges like the 4B model playing a critical role in improving both accuracy and cost efficiency, particularly in coding tasks. This diversity in judgment provides a regularizing effect that enhances robustness without adding significant computational overhead.

Ablating any component of the cost-value function—price, strategy length, entropy, or jury scoring—consistently reduces overall pass@1 performance, confirming that each term meaningfully contributes to the system’s effectiveness. While some ablations yield slightly lower costs in specific cases, these gains come at the expense of accuracy, particularly on more complex tasks. The full function strikes the optimal balance, enabling both higher performance and more efficient resource use across task types and complexity levels.

The oracle router, which selects the smallest agent capable of solving each task, achieves the highest possible pass@1 scores across all complexity levels for both deep search and coding tasks, while minimizing cost by defaulting to cheaper models when correctness is unattainable. Results show that performance declines sharply as task complexity increases, with the most complex tasks (τ(t) ≤ 60) yielding pass@1 scores as low as 25.0 for deep search and 34.2 for coding, indicating inherent difficulty rather than routing inefficiency. The consistent cost advantage of the oracle—especially on simpler tasks—highlights the potential for intelligent routing systems to approach this ideal by better matching agent capability to task demands.

Results show that smaller, cheaper agents perform nearly as well as larger ones on simple tasks but fall significantly behind as complexity increases. The SALE routing system consistently outperforms single agents and alternative routers by dynamically assigning tasks to the most cost-effective agent without sacrificing accuracy, achieving better performance-cost trade-offs across all complexity levels. This advantage stems from strategy-based bidding, jury evaluation, and memory-driven refinement, which collectively enable more efficient resource allocation and gradual improvement in smaller agents’ competitiveness over time.

The authors evaluate the impact of removing self-feedback or peer-feedback from the SALE routing system, finding that both feedback types contribute meaningfully to performance. Without peer-feedback, pass@1 drops sharply, especially on complex tasks, while removing self-feedback leads to lower costs but reduced accuracy, indicating that peer-judgment provides essential external calibration for harder problems. Overall, the system performs best when both feedback mechanisms are active, highlighting their complementary roles in balancing accuracy and efficiency.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Lisa Alazraki William F. Shen Yoram Bachrach Akhil Mathur

초록

One-sentence Summary

Key Contributions

Small agents perform nearly on par with large ones on simple tasks but degrade significantly as complexity increases in deep search and coding workloads, a scaling gap empirically quantified here for the first time using real-world tasks paired with human solution times via the new HST-BENCH benchmark.
The paper introduces SALE, a marketplace-inspired framework where heterogeneous agents bid with strategic plans scored by cost-value metrics and refined through shared auction memory, enabling dynamic task routing and test-time self-improvement without requiring separate training or full model execution.
SALE reduces reliance on the largest agent by 53% and cuts overall cost by 35% while matching or exceeding its performance on complex tasks, outperforming static routers that fail to adapt or reduce cost, demonstrating that auction-based coordination can effectively “scale up” small agents through adaptive orchestration.

Introduction

Dataset

The authors use HST-BENCH, a human-timed evaluation dataset of 753 agentic tasks, to measure performance across deep search and coding domains. Here’s how they construct and use it:

Dataset Composition and Sources
HST-BENCH combines existing open-source benchmarks: SimpleQA, PopQA, HotpotQA, GAIA, Humanity’s Last Exam (HLE), MBPP, and LeetCode. They supplement coding with a custom multiple-choice set (Coding-MCQ) to better populate low-complexity bins.
Key Details per Subset
- Tasks are sampled from official test splits; invalid or unanswerable instances are discarded.
- HLE is restricted to expert-validated chemistry/biology questions.
- GAIA’s human solution times are reused from its validation split.
- LeetCode “Hard” tasks use published timing estimates (Siroš et al., 2024) due to annotation cost.
- Coding-MCQ includes short, conceptual questions targeting core programming knowledge (examples in Appendix A.4).
Complexity Binning and Annotation
- Human solution time (τ(t)) is annotated by 3+ expert annotators per task (CS graduates), using permitted tools only (e.g., browser, IDE).
- Times are filtered for correctness and outliers (>2 SD from mean), then averaged.
- Tasks are grouped into 5 non-overlapping bins based on τ(t): 0–0.1 min, 0.1–0.5 min, 0.5–2.5 min, 2.5–12.5 min, 12.5–60 min.
- Bins follow a geometric progression (5× spacing) to balance sample sizes across 3 orders of magnitude in solution time.
- Inter-annotator reliability is high (Krippendorff’s α = 0.86).
Use in Model Evaluation
- The test split contains 753 tasks; separate dev sets (68 for search, 88 for coding) are used for tuning.
- Model performance is analyzed per complexity bin to study how scaling affects agentic capability.
- The authors use Qwen3 models (4B–32B) for controlled scaling experiments, isolating size effects while holding architecture and training constant.
- Evaluation metrics include success rate and cost (token-based), with future extensions planned for tool pricing.
Processing and Metadata
- Metadata includes source dataset, complexity bin, and aggregated human solution time.
- Table 3 (in paper) shows dataset contribution per bin — low bins dominated by SimpleQA/Coding-MCQ, high bins by GAIA, HLE, and LeetCode Hard.
- No cropping or data augmentation is applied; tasks are used in original form with standardized timing protocols.

Method

C_{t,i} = w_c \cdot \pi(a_i) \cdot |s_{t,i}|.

Value $V_{t,i}$ is computed as a weighted combination of intrinsic and extrinsic signals:

V_{t,i} = w_h \cdot H(s_{t,i}) + \sum_{a_j \in \mathcal{A}} w_j \cdot \gamma_j(s_{t,i}),

z_t = \sum_{a_i \in \mathcal{A}} x_{t,i} \left( C_{t,i} - V_{t,i} \right),

which reduces to selecting the agent with the lowest $C_{t,i} - V_{t,i}$ .

Experiment

Smaller, cheaper agents perform nearly as well as larger ones on simple tasks but fall sharply behind as task complexity increases, showing strong stratification by model size and cost.
Larger agents do not inherently solve complex tasks more token-efficiently; they often use similar or more tokens than smaller agents, failing to offset their higher per-token cost.
SALE, a strategy-based auction system with memory-driven self-refinement, consistently outperforms single agents and existing routers by dynamically assigning tasks to the most cost-effective agent while improving or matching accuracy.
SALE reduces cost by 17–53% across task bins while improving or maintaining pass@1, pushing the performance-cost Pareto frontier beyond any single model or baseline router.
Smaller agents (4B, 8B) handle a significant share of workload even on complex tasks, and their contribution grows over time as auction feedback enables progressive refinement of their strategies.
Shapley value analysis confirms that smaller agents contribute meaningfully to the ensemble through jury scoring and memory, even when rarely selected for final execution.
Qualitative analysis reveals complementary failure modes: larger agents often over-engineer or skip verification, while smaller agents rely more on tools and checks, enabling SALE to exploit strategic differences at bid time.
Ablations show that all components of SALE’s cost-value function and jury mechanism are essential; removing any degrades performance or efficiency, and jury diversity provides robust, low-overhead gains.
SALE’s routing is conservative, favoring accuracy over cost—often over-escalating to larger agents on easy tasks—but rarely under-escalates, preserving correctness where it matters most.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

작은 에이전트의 전략 경매를 통한 확장

Lisa Alazraki William F. Shen Yoram Bachrach Akhil Mathur

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

작은 에이전트의 전략 경매를 통한 확장

Lisa Alazraki William F. Shen Yoram Bachrach Akhil Mathur

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

작은 에이전트의 전략 경매를 통한 확장

Lisa Alazraki William F. Shen Yoram Bachrach Akhil Mathur

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters