HyperAIHyperAI

Command Palette

Search for a command to run...

エージェント・ハーネス工学:調査

概要

大規模言語モデル(LLM)エージェントの実環境(プロダクション)への急速な展開により、一貫した傾向が顕在化している。それは、タスク実行の信頼性は基盤となるモデルそのものよりも、それを包摂するインフラストラクチャ層、すなわち「エージェント実行ハルサー(エージェント実行ハーネス)」に大きく依存しているという点である。本調査論文は、実装の知見に基づいた体系的なアプローチにより、エージェントハルサー工学を、以下の3つの主要な主張を軸に整理・提示する。第一に、エージェントハルサーは独立したシステムレイヤーであり、その工学上の品質が実世界における信頼性の大部分を支配している。この立場を確立するため、本稿では、プロンプト、コンテキスト、ハルサー工学へと移行する3フェーズからなる工学進化論、コスト・品質・速度の三者悪(トライレマ)、能力と制御のトレードオフ、ハルサーの結合問題を含むクロスレイヤーの合成分析、さらに研究の空白地帯と実運用上の課題(パイポイント)に裏打ちされたオープンな課題アジェンダを展開する。第二に、本稿はETCLOVGという7層の分類体系(実行環境、ツールインターフェース、コンテキスト管理、ライフサイクル/オーケストレーション、観測性、検証、ガバナンス)を提案する。この分類体系は、先行する6コンポーネント框架を拡張するものであり、観測性(Observability)とガバナンス(Governance)を独立したアーキテクチャ上の懸念事項として位置づけている。

One-sentence Summary

This survey provides a practice-grounded, systematic treatment of agent harness engineering, arguing that the infrastructure layer drives LLM agent reliability more than the underlying model, and introduces ETCLOVG, a seven-layer taxonomy extending prior six-component frameworks by treating observability and governance as independent architectural concerns to address production pain points and the cost-quality-speed trilemma.

Key Contributions

  • The paper introduces ETCLOVG, a seven-layer taxonomy encompassing Execution environment, Tool interface, Context management, Lifecycle/Orchestration, Observability, Verification, and Governance. This framework extends prior six-component models by treating Observability and Governance as independent architectural concerns.
  • A mapping of over 148 open-source projects onto the taxonomy provides the most extensive ecosystem snapshot to date. This analysis surfaces adoption patterns, coverage gaps, and emerging design principles within the agent infrastructure landscape.
  • The work establishes the agent harness as an independent system layer driving real-world reliability through a three-phase engineering evolution from prompt to context to harness engineering. This synthesis covers the cost-quality-speed trilemma, capability-control tradeoff, and harness coupling problem to situate harness engineering within a broader trajectory.

Introduction

The rapid deployment of large language model agents in production reveals that task reliability depends less on the underlying model than on the infrastructure layer wrapping it. Prior research has focused heavily on model capabilities while practitioners lack the formal vocabulary to systematically improve the integrating system. The authors address this gap by advancing the binding-constraint thesis, which positions the agent harness as the primary driver of real-world reliability. They introduce ETCLOVG, a seven-layer taxonomy that treats observability and governance as independent architectural concerns rather than side effects. Additionally, the team maps over 140 open-source projects to this framework to identify ecosystem patterns and distill engineering principles from production deployments.

Dataset

  • Dataset Composition and Sources: The authors constructed a systematic corpus mapping publicly documented agent-harness artifacts from four streams: prior surveys, GitHub searches, curated lists, and company engineering blogs.
  • Key Subsets and Examples: The collection includes general-purpose sandboxes like Daytona and E2B, computer-use infrastructure such as Anthropic Computer Use, and browser environments like WebArena. Software engineering benchmarks including SWE-bench and Terminal-Bench are also mapped.
  • Usage and Analysis: The dataset serves as a map of the visible agent-harness ecosystem rather than a training split. The authors use it to assign artifacts to seven ETCLOVG layers based on public evidence.
  • Processing and Metadata: Projects were filtered to exclude simple chatbots and static datasets. Metadata such as project names and release years were recorded in a snapshot frozen on May 08, 2026. Coding followed a single-primary-coder protocol with author audit.

Method

The authors propose a seven-layer taxonomy for agent harness engineering, referred to by the acronym ETCLOVG, which stands for Execution, Tooling, Context, Lifecycle, Observability, Verification, and Governance. This framework distinguishes between the structural core of a harness and the control plane surrounding it. The first four layers describe the structural core. Execution (E) determines where agent code runs and what sandbox constraints bound it. Tooling (T) specifies how external capabilities are described, discovered, and invoked. Context (C) controls what the model can see over short-term, session-level, and persistent horizons. Lifecycle (L) organizes the control flow that reads and writes that state, ranging from single-agent loops to multi-agent workflows.

The remaining three layers describe the control plane. Observability (O) captures traces, costs, failures, and reliability signals. Verification (V) turns tasks and traces into evaluation, failure attribution, and regression feedback. Governance (G) constrains behavior through permission, identity, policy, hardening, audit, and human oversight mechanisms. Two design choices distinguish this taxonomy. First, Observability is promoted to an independent layer rather than treated as a side effect of lifecycle hooks. Second, Governance is introduced as a first-class layer that captures the full spectrum of security and compliance concerns.

Verification and evaluation is organized as a task-to-feedback lifecycle. This process begins with task and benchmark grounding, followed by pre-execution readiness validation. Controlled execution and trace capture run the agent under reproducible conditions. Multi-level judgement and failure attribution evaluate the run at outcome, trajectory, and evaluator levels. Finally, continuous regression and deployment feedback convert results into engineering evidence for harness improvement. This lifecycle reframes evaluation from a leaderboard mechanism into a quality-control loop for agent harnesses.

Governance is integrated through lifecycle hooks that define when policy checks fire. Many harnesses expose hook points at each stage of the agent loop. Pre-execution hooks validate input before it reaches the LLM. Pre-invocation hooks inspect the proposed action before tool execution. Post-execution hooks mediate information flow from tool output back into context. Human-in-the-loop hooks gate consequential actions on user approval. These hooks allow governance logic to be injected without modifying the agent's core reasoning.

Experiment

An aggregate analysis of over 170 projects indicates that execution and tooling infrastructure are mature, whereas governance and observability remain fragmented across open-source ecosystems. Memory system experiments validate various architectural strategies, including hybrid storage and collective learning, demonstrating a shift from research prototypes to production-ready infrastructure. Evaluation frameworks prioritize pre-execution readiness and trajectory-level analysis to ensure reproducibility and generate specific engineering feedback, while governance gaps highlight the need for standardized policies and unified adversarial benchmarks to support safe deployment.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています