HyperAIHyperAI

Command Palette

Search for a command to run...

منذ 14 ساعات
إيجرنت
LLM

هندسة حشوات الوكلاء (Agent Harness Engineering): مراجعة شاملة

الملخص

كشف النشر السريع لوكلاء النماذج اللغوية الكبيرة (LLM) في البيئات الإنتاجية نمطاً متكرراً: إن موثوقية تنفيذ المهام لا تعتمد بشكل أساسي على النموذج الأساسي بقدر ما تعتمد على طبقة البنية التحتية التي تحيط به، وهي إطار تشغيل الوكيل (agent execution harness). تقدّم هذه المراجعة معالجة منهجية مستندة إلى الممارسات العملية لهندسة إطار تشغيل الوكلاء، منظّمة حول ثلاث فرضيات رئيسية.أولاً، يُشكّل إطار تشغيل الوكيل طبقة نظامية مستقلة، وجودة هندسته تدفع حيزاً كبيراً من الموثوقية في العالم الحقيقي. ونؤيد هذا الموقف من خلال تتبع التطور الهندسي الثلاثي المراحل، بدءاً من هندسة المطالبات (prompt) وصولاً إلى إدارة السياق (context) ثم هندسة إطار التشغيل (harness). ويتضمن ذلك تركيباً عابراً للطبقات يغطي المعضلة الثلاثية للتكلفة والجودة والسرعة، ومفاضلة القدرة مقابل التحكم، ومشكلة ارتباط إطار التشغيل. كما نعقد أجندة للمسائل المفتوحة المستندة إلى كل من الفجوات البحثية ونقاط الألم في البيئات الإنتاجية.ثانياً، نقترح تصنيفاً هرمياً ذو سبع طبقات يُعرف بـ ETCLOVG، والذي يتضمن: بيئة التنفيذ، وواجهة الأدوات، وإدارة السياق، ودورة الحياة/التنسيق، والقابلية للمراقبة، وال التحقق، والحوكمة. ويوسّع هذا التصنيف الأطر السابقة المكوّنة من ستة مكونات من خلال اعتبار القابلية للمراقبة والحوكمة كمهمتين معماريتين مستقلتين.

One-sentence Summary

This survey provides a practice-grounded, systematic treatment of agent harness engineering, arguing that the infrastructure layer drives LLM agent reliability more than the underlying model, and introduces ETCLOVG, a seven-layer taxonomy extending prior six-component frameworks by treating observability and governance as independent architectural concerns to address production pain points and the cost-quality-speed trilemma.

Key Contributions

  • The paper introduces ETCLOVG, a seven-layer taxonomy encompassing Execution environment, Tool interface, Context management, Lifecycle/Orchestration, Observability, Verification, and Governance. This framework extends prior six-component models by treating Observability and Governance as independent architectural concerns.
  • A mapping of over 148 open-source projects onto the taxonomy provides the most extensive ecosystem snapshot to date. This analysis surfaces adoption patterns, coverage gaps, and emerging design principles within the agent infrastructure landscape.
  • The work establishes the agent harness as an independent system layer driving real-world reliability through a three-phase engineering evolution from prompt to context to harness engineering. This synthesis covers the cost-quality-speed trilemma, capability-control tradeoff, and harness coupling problem to situate harness engineering within a broader trajectory.

Introduction

The rapid deployment of large language model agents in production reveals that task reliability depends less on the underlying model than on the infrastructure layer wrapping it. Prior research has focused heavily on model capabilities while practitioners lack the formal vocabulary to systematically improve the integrating system. The authors address this gap by advancing the binding-constraint thesis, which positions the agent harness as the primary driver of real-world reliability. They introduce ETCLOVG, a seven-layer taxonomy that treats observability and governance as independent architectural concerns rather than side effects. Additionally, the team maps over 140 open-source projects to this framework to identify ecosystem patterns and distill engineering principles from production deployments.

Dataset

  • Dataset Composition and Sources: The authors constructed a systematic corpus mapping publicly documented agent-harness artifacts from four streams: prior surveys, GitHub searches, curated lists, and company engineering blogs.
  • Key Subsets and Examples: The collection includes general-purpose sandboxes like Daytona and E2B, computer-use infrastructure such as Anthropic Computer Use, and browser environments like WebArena. Software engineering benchmarks including SWE-bench and Terminal-Bench are also mapped.
  • Usage and Analysis: The dataset serves as a map of the visible agent-harness ecosystem rather than a training split. The authors use it to assign artifacts to seven ETCLOVG layers based on public evidence.
  • Processing and Metadata: Projects were filtered to exclude simple chatbots and static datasets. Metadata such as project names and release years were recorded in a snapshot frozen on May 08, 2026. Coding followed a single-primary-coder protocol with author audit.

Method

The authors propose a seven-layer taxonomy for agent harness engineering, referred to by the acronym ETCLOVG, which stands for Execution, Tooling, Context, Lifecycle, Observability, Verification, and Governance. This framework distinguishes between the structural core of a harness and the control plane surrounding it. The first four layers describe the structural core. Execution (E) determines where agent code runs and what sandbox constraints bound it. Tooling (T) specifies how external capabilities are described, discovered, and invoked. Context (C) controls what the model can see over short-term, session-level, and persistent horizons. Lifecycle (L) organizes the control flow that reads and writes that state, ranging from single-agent loops to multi-agent workflows.

The remaining three layers describe the control plane. Observability (O) captures traces, costs, failures, and reliability signals. Verification (V) turns tasks and traces into evaluation, failure attribution, and regression feedback. Governance (G) constrains behavior through permission, identity, policy, hardening, audit, and human oversight mechanisms. Two design choices distinguish this taxonomy. First, Observability is promoted to an independent layer rather than treated as a side effect of lifecycle hooks. Second, Governance is introduced as a first-class layer that captures the full spectrum of security and compliance concerns.

Verification and evaluation is organized as a task-to-feedback lifecycle. This process begins with task and benchmark grounding, followed by pre-execution readiness validation. Controlled execution and trace capture run the agent under reproducible conditions. Multi-level judgement and failure attribution evaluate the run at outcome, trajectory, and evaluator levels. Finally, continuous regression and deployment feedback convert results into engineering evidence for harness improvement. This lifecycle reframes evaluation from a leaderboard mechanism into a quality-control loop for agent harnesses.

Governance is integrated through lifecycle hooks that define when policy checks fire. Many harnesses expose hook points at each stage of the agent loop. Pre-execution hooks validate input before it reaches the LLM. Pre-invocation hooks inspect the proposed action before tool execution. Post-execution hooks mediate information flow from tool output back into context. Human-in-the-loop hooks gate consequential actions on user approval. These hooks allow governance logic to be injected without modifying the agent's core reasoning.

Experiment

An aggregate analysis of over 170 projects indicates that execution and tooling infrastructure are mature, whereas governance and observability remain fragmented across open-source ecosystems. Memory system experiments validate various architectural strategies, including hybrid storage and collective learning, demonstrating a shift from research prototypes to production-ready infrastructure. Evaluation frameworks prioritize pre-execution readiness and trajectory-level analysis to ensure reproducibility and generate specific engineering feedback, while governance gaps highlight the need for standardized policies and unified adversarial benchmarks to support safe deployment.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp