HyperAIHyperAI

Command Palette

Search for a command to run...

한 달 전
에이전트
LLM

AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang Jerry Huang Jiarui Yao Rui Pan Peizhi Niu Yaowenqi Liu Ruida Wang Renhao Lu Yuwei Guo Tong Zhang

초록

언어 모델 agent 시스템은 일반적으로 반응형 프롬프팅(reactive prompting)에 의존합니다. 이는 단일 지침이 모델을 개방형 추론 및 도구 사용 시퀀스로 유도하는 방식인데, 이 과정에서 제어 흐름(control flow)과 중간 상태(intermediate state)가 암시적으로 처리되어 agent의 동작을 제어하기가 잠재적으로 어려워질 수 있습니다. LangGraph, DSPy, CrewAI와 같은 오케스트레이션 프레임워크는 명시적인 워크플로우 정의를 통해 더 높은 구조를 부여하지만, 워크플로우 로직이 Python과 밀접하게 결합되어 있어 agent를 유지 관리하고 수정하기 어렵다는 단점이 있습니다.본 논문에서는 명시적인 제어 흐름과 모듈형 구조를 갖춘 LLM-agent 워크플로우를 지정하기 위한 Agent SPecification and EXecution Language인 AgentSPEX와 맞춤형 agent harness를 소개합니다. AgentSPEX는 타입이 지정된 단계(typed steps), 분기 및 루프, 병렬 실행, 재사용 가능한 서브모듈, 그리고 명시적 상태 관리를 지원합니다. 이러한 워크플로우는 도구 접근, 샌드박스 가상 환경, 그리고 체크포인팅(checkpointing), 검증 및 로깅을 지원하는 agent harness 내에서 실행됩니다. 또한, 워크플로우 작성 및 검사를 위해 그래프와 워크플로우 뷰가 동기화된 시각적 에디터를 제공합니다.본 연구에서는 심층 연구(deep research) 및 과학 연구를 위한 즉시 사용 가능한 agent들을 포함하였으며, 7개의 벤치마크를 통해 AgentSPEX를 평가했습니다. 마지막으로 사용자 연구를 통해 AgentSPEX가 기존의 대중적인 agent 프레임워크보다 더욱 해석 가능하고 접근성이 높은 워크플로우 작성 패러다임을 제공함을 입증합니다.

One-sentence Summary

To address the limitations of reactive prompting and tightly coupled Python orchestration frameworks, the paper introduce AgentSPEX, an agent specification and execution language that enables controllable and maintainable LLM-agent workflows through explicit control flow, modular structures, and a customizable harness featuring typed steps, state management, and a sandboxed environment, which the paper evaluate across seven benchmarks using ready-to-use agents for deep and scientific research.

Key Contributions

  • The paper introduces AgentSPEX, an agent specification and execution language that uses declarative YAML files to define LLM-agent workflows with explicit control flow, branching, loops, and modular submodules.
  • A customizable agent harness is provided to execute these workflows, offering a sandboxed virtual environment, tool access, and advanced capabilities such as state checkpointing, trajectory logging, and replay functionality.
  • The work includes a bidirectional visual editor for drag-and-drop workflow construction and demonstrates the framework's utility through ready-to-use research agents and evaluations across seven benchmarks in science, writing, and software engineering.

Introduction

Modern language model agent systems often rely on reactive prompting, where a single instruction guides an open ended sequence of reasoning. While simple, this approach lacks precise control over intermediate states and can struggle with long horizon tasks. Existing orchestration frameworks attempt to provide structure but typically couple workflow logic tightly with Python code, which creates steep learning curves and makes agents difficult to maintain or share with non-programmers. The authors leverage a new approach called AgentSPEX, an agent specification and execution language that uses declarative YAML syntax to define workflows. This contribution provides explicit control over branching, loops, and state management while offering a customizable execution harness and a visual editor to make agent authoring more accessible and interpretable.

Dataset

The authors utilize several specialized datasets to evaluate scientific reasoning and generative writing capabilities:

  • Science Datasets:
    • ChemBench: A subset of 90 question-answer pairs is created by randomly sampling 10 questions from each of the 9 domains within the original framework.
    • SciBench: The authors use a collection of 213 collegiate-level chemistry problems sourced from four specific textbooks: Atkins' Physical Chemistry (101 problems), Chemistry by McMurry & Fay (33 problems), Properties of Matter (47 problems), and Quantum Chemistry (32 problems).
    • MMLU-Pro Stemez: The Physical Chemistry subset is used, consisting of 216 problems sourced from the Stemez website.
  • Generative Writing Dataset:
    • WritingBench: To evaluate writing, the authors construct a 120-question subset by randomly sampling 20 questions from each of the 6 major domains.
  • Evaluation and Processing:
    • For the science benchmarks, the authors employ exact match evaluation against reference solutions.
    • For the writing benchmark, the authors follow an LLM-as-Judge protocol where a critic model assesses responses based on 5 instance-specific criteria using a 10-point scale.

Method

The core architecture of AgentSPEX is structured around a declarative workflow specification language and an agent harness that executes these workflows in a controlled, observable, and durable environment. The system begins with a YAML-based workflow definition that specifies an agent's name, goal, configuration, and a sequence of operations. This workflow defines a structured execution plan using primitives such as task and step, which determine how the agent interacts with tools and maintains conversation history. A task initiates a fresh interaction with no prior context, while a step allows for multi-turn reasoning by preserving conversation history across multiple interactions. These operations are supported by control flow constructs, including if, while, for_each, and call, enabling complex logic and modular composition. The call construct allows workflows to invoke other workflows as reusable submodules, promoting modularity and code reuse. The framework supports state management through context variables that are passed between steps using Mustache-style templates, such as {{variable}}, and can be saved using the save_as directive. This enables the seamless flow of intermediate results across the workflow.

The workflow definition is designed to be human-readable and editable, allowing domain experts to author and modify workflows without requiring programming expertise. Workflows are self-contained YAML files, facilitating version control and collaboration. To assist in workflow development, AgentSPEX provides a visual editor that renders workflows as interactive flowcharts, where each node corresponds to an operation. Users can modify the workflow graphically or directly in the synchronized YAML editor, with changes immediately reflected in both views. This visual interface supports rapid prototyping and debugging, enhancing the development experience. The visual editor is demonstrated in Figure 3, which illustrates a deep research agent workflow with reusable submodules.

The execution of these workflows is managed by an agent harness that orchestrates the entire process. The harness includes an interpreter that validates the workflow structure, resolves configuration parameters, and expands template variables. It then dispatches each operation to the appropriate handler, managing recursion for nested constructs such as loops and conditionals. Each operation is assigned a hierarchical step identifier for tracking and checkpointing. The executor implements the interaction loop between the language model and external tools, managing multi-turn conversations until the model returns a response with no further tool calls or reaches a configured limit. Execution occurs within a Docker-based sandbox environment, providing isolation and access to a comprehensive suite of tools, including web search, file operations, and browser automation. This sandbox is equipped with a browser, file system access, and a Model Context Protocol (MCP) client for tool execution.

The agent harness also features a robust observability dashboard that provides real-time monitoring and debugging capabilities. This dashboard logs agent actions and intermediate reasoning steps, allowing developers to inspect behavior at each stage of execution. For long-running workflows, the system ensures durability through checkpointing and execution tracing. Checkpoints are saved after each step, capturing the current context, variables, and sandbox state, enabling the workflow to resume from any point in case of interruption. Execution tracing records a full history of model responses, tool calls, and conversation states. The system supports selective trace replay, allowing developers to isolate the impact of changes to specific steps without re-executing the entire workflow. Furthermore, the declarative nature of AgentSPEX enables formal verification of workflows. By making control flow, variable dependencies, and step boundaries explicit in the YAML specification, the framework supports static and runtime verification of both structural and semantic correctness. This is demonstrated through the verification of the extract_single_citation_module, where pre- and post-conditions for each step are defined and verified using formal methods. The verification process checks that variables satisfy expected properties at each node, ensuring the workflow executes correctly and reliably. This capability provides a pathway for regulating agentic behavior and ensuring the reliability of complex autonomous systems.

Experiment

AgentSPEX was evaluated across seven diverse benchmarks in domains such as science, mathematics, writing, and software engineering to compare its performance against chain-of-thought and ReAct baselines. The results demonstrate that AgentSPEX consistently outperforms existing frameworks, particularly on tasks requiring complex reasoning or the processing of extensive documents where enforced step-by-step execution and explicit context management are beneficial. Additionally, a user study revealed that while developers find AgentSPEX more readable and accessible for creating new workflows, they perceive existing tools like LangGraph as more suitable for highly complex constructions.

The authors evaluate AgentSPEX on a benchmark comparing its performance against other agent frameworks across different models. Results show that AgentSPEX maintains consistent performance across model versions, with minimal degradation when upgrading from one model to another, while other frameworks exhibit more significant drops in performance. AgentSPEX shows minimal performance drop across model versions compared to other frameworks. AgentSPEX maintains stable results when switching between model versions. Other frameworks exhibit larger performance declines when transitioning between model versions.

The authors compare AgentSPEX with several existing agent frameworks across three key capabilities: natural language specification, explicit context management, and visual editing. Results show that AgentSPEX supports natural language and explicit context, while also offering a visual editor, positioning it as a comprehensive framework. Other frameworks either lack one or more of these features or only partially support them. AgentSPEX supports natural language specification, explicit context management, and a visual editor, unlike most other frameworks. Several frameworks, including LangGraph and n8n, support a visual editor but lack full support for natural language or explicit context. AgentSPEX is positioned as a more comprehensive framework by combining all three capabilities, while others offer partial or limited support.

The authors evaluate AgentSPEX across multiple benchmarks in science, mathematics, writing, paper understanding, and software engineering, comparing it against baselines such as CoT and ReAct. Results show that AgentSPEX consistently outperforms the baselines across all domains, with notable improvements on tasks requiring extended reasoning or complex input processing. The framework also demonstrates strong performance on software engineering benchmarks, achieving competitive results against specialized agents. A user study indicates that participants find AgentSPEX more interpretable and easier to use, especially for non-coders, though they perceive LangGraph as more suitable for complex workflows. AgentSPEX achieves the highest performance across all evaluated benchmarks compared to CoT and ReAct baselines. The framework shows significant improvements on tasks involving extensive input or multi-step reasoning, such as paper understanding and chemistry problems. Users rate AgentSPEX as more interpretable and accessible, particularly for those without coding expertise.

The authors evaluate AgentSPEX across seven benchmarks spanning science, mathematics, writing, paper understanding, and software engineering domains, comparing it against baselines such as chain-of-thought and ReAct. Results show that AgentSPEX outperforms all compared approaches on every benchmark, with particularly strong improvements on tasks requiring extended reasoning or complex input processing. AgentSPEX achieves the highest performance on all seven evaluated benchmarks across diverse domains. The framework shows significant improvements over baselines on benchmarks involving long-form reasoning and multi-step problem-solving. AgentSPEX outperforms both chain-of-thought and ReAct baselines, with larger gains observed on tasks requiring extensive context management.

The authors evaluate AgentSPEX through benchmark comparisons against existing frameworks and reasoning baselines, alongside a user study to assess usability. The results demonstrate that AgentSPEX offers superior stability across different model versions and provides a more comprehensive feature set by integrating natural language specification, explicit context management, and visual editing. Furthermore, the framework consistently outperforms standard reasoning approaches across diverse domains, particularly in tasks requiring complex, multi-step reasoning, while remaining highly interpretable and accessible to non-coders.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
AgentSPEX: An Agent SPecification and EXecution Language | 문서 | HyperAI초신경