HyperAIHyperAI

Command Palette

Search for a command to run...

ClawMark: 다중 턴, 다중 일일, 다중 모달 coworker agents를 위한 라이브 월드 벤치마크

초록

언어 모델 기반 에이전트(language-model agents)는 여러 근무일에 걸쳐 사용자를 보조하는 지속적인 동료(coworker)로서 점점 더 광범위하게 활용되고 있습니다. 이러한 업무 워크플로우 중 에이전트의 의도와 무관하게 주변 환경이 변화할 수 있습니다. 예를 들어 새 이메일이 도착하거나, 캘린더 항목이 변경되며, 지식베이스 기록이 업데이트되고, 이미지, 스캔된 PDF, 오디오, 비디오, 스프레드시트 등 다양한 미디어에 증거 자료가 추가됩니다. 기존 벤치마킹 평가는 주로 단일 정적 에피소드(static episode) 내에서 실행되고 텍스트 중심의 평가에 치우쳐 있어, 이러한 동적 환경을 adequately(충분히) 평가하지 못하는 한계가 있습니다.이에 본 논문에서는 다중 턴(multi-turn) 다일(day) 기반 태스크, 턴 간 상태가 진화하는 상태 유지형 샌드박스 서비스 환경(stateful sandboxed service environment), 그리고 규칙 기반 검증(rule-based verification)을 핵심으로 구성된 동료 에이전트(coworker agent)용 벤치마크인 [논문 제목 누락]을 제안합니다. 현재 릴리스판에는 13개의 전문 시나리오에 걸쳐 100개의 태스크가 포함되어 있으며, 파일 시스템, 이메일, 캘린더, 지식베이스, 스프레드시트 등 5가지 상태 유지형 샌드박스 서비스와 상호작용하도록 설계되었습니다. 평가 점수는 실행 후 서비스의 최종 상태를 바탕으로 1,537개의 결정론적 Python 검사기(deterministic Python checkers)를 통해 산출되며, 채점 과정에서 LLM-as-judge(대언어 모델 판사)는 일절 사용되지 않습니다.우리는 7개의 최첨단 에이전트 시스템(benchmark frontier agent systems)에 대해 평가를 수행했습니다. 가장 성능이 뛰어난 모델은 가중치 점수(weighted score) 75.8을 기록했으나, 엄격한 태스크 성공률(strict Task Success)은 20.0%에 불과했습니다. 이는 부분적인 진행은 흔히 발생하지만, 완전한 엔드투엔드(end-to-end) 워크플로우 완성은 여전히 드문 현상임을 시사합니다. 턴 수준(turn-level) 분석 결과, 첫 번째 외부 환경 변화(exogenous environment update) 이후 성능이 저하되는 양상을 보여, 변화하는 상태에 대한 적응(adaptation)이 핵심적인 미해결 과제임을 강조합니다. 본 연구는 재현 가능한 동료 에이전트(coworker-agent) 평가를 위해 벤치마크, 평가 해니스(evaluation harness), 그리고 구축 파이프라인(construction pipeline)을 공개합니다.

One-sentence Summary

The ClawMark Team introduces CLAWMARK, a benchmark of 100 tasks across 13 professional scenarios that evaluates seven frontier coworker agents on multi-turn, multi-day workflows within a stateful sandboxed environment, utilizing 1,537 deterministic Python checkers to score outcomes and reveal that even the strongest model achieves only 20.0% strict task success while struggling to adapt to exogenous environment updates.

Key Contributions

  • CLAWMARK is introduced as a benchmark for coworker agents that integrates multi-turn multi-day workflows with a stateful sandboxed environment where exogenous updates occur independently between agent turns.
  • The evaluation framework implements a deterministic scoring protocol using 1,537 Python checkers to verify post-execution service states, thereby eliminating reliance on LLM-as-judge assessments.
  • Benchmarking seven frontier agent systems reveals that while the highest weighted score reaches 75.8, strict task success remains at 20.0%, with performance degrading after initial environment updates to underscore adaptation as a primary challenge.

Introduction

As language models evolve from single-session tools into persistent digital coworkers, evaluating them in dynamic, real-world office environments has become essential. Current benchmarks struggle to capture this reality because they treat interactions as static snapshots, ignore external state changes that occur independently of the agent, and rely almost exclusively on text-based inputs. To address these gaps, the authors introduce CLAWMARK, a benchmark that simulates multi-day coworker workflows across stateful sandboxed services like email, calendars, and shared drives. The framework subjects agents to exogenous environment updates and untranscribed multimodal evidence while replacing subjective LLM judges with a deterministic, rule-based scoring protocol. By providing a reproducible evaluation harness and testing seven frontier models, the authors establish a rigorous standard for measuring persistent agent adaptation, external state tracking, and multimodal integration.

Dataset

1. Dataset composition and sources

  • The authors introduce CLAWMARK, a benchmark consisting of 100 tasks distributed across 13 professional scenarios and 87 distinct in-task roles.
  • The corpus relies on 1,072 raw multimodal artifacts, including photos, audio, scanned PDFs, video, and spreadsheets.
  • Evidence is sourced through three channels: web collection of public domain documents, original recordings, and targeted AI synthesis, all tagged with provenance information.

2. Key details for each subset

  • The dataset does not use traditional training or test splits. Instead, tasks are organized by professional domain, ranging from general office roles to specialized fields like clinical assistance and electronic design automation.
  • Each task spans two to six turns, with one turn representing a single in-universe working day.
  • Tasks contain between six and twenty-nine weighted evaluation checkers, including fifty-five strict red-line constraints that enforce hard rules.
  • Descriptive metadata tracks the mean share of silent versus loud between-turn mutations, though this classification is used solely for corpus characterization and does not influence scoring.

3. How the paper uses the data

  • The authors use the dataset exclusively for evaluation rather than model training, so no training splits or mixture ratios are applied.
  • Seven frontier agent systems are benchmarked using a single, framework-agnostic harness to isolate model performance differences.
  • Scoring is entirely rule-based. The authors employ 1,537 deterministic Python checkers that inspect the post-execution state of five sandboxed services including filesystem, email, calendar, knowledge base, and spreadsheet.
  • The evaluation pipeline guarantees deterministic results by requiring bit-identical checker verdicts across independent re-runs, explicitly avoiding LLM-as-judge methods.

4. Processing and metadata construction

  • No cropping or preprocessing is applied to the multimodal evidence; raw files are delivered directly to the agents for independent parsing.
  • Task specifications are constructed using a single task.py file per scenario, which defines turn structures, service seed hooks, between-turn injection scripts, and weighted checker rubrics.
  • The environment evolves independently between turns through two mutation types: loud events announced in wake-up prompts and silent mutations injected directly into services without notification.
  • A strict release gate filters the corpus during construction, requiring human sign-off, successful AI audits, and consistent re-run results. Tasks that fail deterministic verification more than twice are redesigned or removed.

Method

The authors leverage a modular framework designed to bridge natural-language task specifications with executable, multi-turn, multi-service agent evaluations. The core of the system revolves around a task definition file, task.py, which encapsulates all necessary components for a single evaluation episode. This file defines four key elements: turn entries, each an asynchronous function specifying the wake-up prompt, allowed tools, and service-side mutation hooks; inject layers, which provide evidence files introduced at the beginning of each turn; checker functions, deterministic Python routines that assess post-turn service state against rubric items; and a rubric mapping that assigns weights and turn assignments to each checker, with red-line checkers distinguished by ID convention and a fixed high weight.

The framework begins with a loader that parses the task definition file and associated staged input files into runtime objects. The task_loader.py module reads task.py, validates required fields, collects stage-specific evidence from the assets/ and inject/ directories, and builds the task object. This task object, defined in models.py, contains the prompt, stage functions, rubric, and environment configuration. A shared context, managed by context.py, holds the service manager and current snapshots of the backend services. This structured representation allows the orchestrator to execute each turn within an isolated docker-compose stack, which includes the agent container and five services: a Docker-mounted filesystem, GreenMail SMTP/IMAP, Notion API, Google Sheets API, and a Radicale CalDAV server.

The execution process proceeds in stages. The benchmark entry point, main.py, parses the task folder, boots the runtime, and builds the context. The orchestrator then executes each stage, starting with a seed or mutate state, followed by the agent taking actions, downloading outputs, and capturing a snapshot. After each turn, the checker system, implemented in orchestrator.py, runs all relevant checkers against the current service state. The checker module evaluates each rubric item, returning a pass/fail outcome. The results are aggregated into a final TaskResult record, which includes the per-checker outcomes and the final score, ensuring a comprehensive and deterministic evaluation of the agent's performance across all turns.

Experiment

The evaluation assesses seven frontier models across 100 tasks in 13 professional scenarios using a deterministic, sandboxed framework to measure end-to-end agent performance. The main benchmark validates general capability and domain specialization, revealing that top systems remain closely clustered with no single model dominating all tasks. Turn-by-turn trajectory analysis examines adaptive reasoning under dynamic environment changes, demonstrating that most agents struggle to recover after external state mutations, while a failure-mode taxonomy isolates structural weaknesses by showing that silent-change detection and backend writeback errors dominate overall mistakes. Collectively, these experiments indicate that reliable professional deployment requires robust state synchronization and strict compliance adherence, as aggregate scores often mask critical trajectory and constraint violations.

The experiment evaluates multiple models across a diverse set of professional scenarios, with results showing significant variation in performance across different domains. The benchmark highlights that no single model dominates all scenarios, and that models exhibit distinct strengths and weaknesses in handling complex, multi-turn tasks involving external state changes and compliance constraints. Performance is further influenced by the model's ability to recover after environmental shifts and avoid critical failure modes such as missing silent updates or failing to commit backend actions. Performance varies significantly across scenarios, with models showing specialization rather than a uniform top ranking. The majority of failures occur in detecting silent changes and committing backend actions, indicating key areas for improvement. Models struggle to recover after the first exogenous environment change, with most remaining below their initial performance level by the final turn.

The authors evaluate seven frontier models on a benchmark that measures coworker-agent behavior across multi-turn, multi-day workflows, focusing on responses to exogenous environment changes and compliance with safety constraints. Results show that no model achieves high performance across all metrics, with significant drops in performance after the first external mutation and a high failure rate in detecting silent changes and committing backend actions. The top models differ in strengths across scenarios, and performance is not consistently correlated with resource usage or tool efficiency. No model achieves high task success or full compliance, with the highest score falling short of 76 and the best task success at 20% Models show significant performance drops after the first exogenous environment change, with most failing to recover to initial levels Failure rates are highest for detecting silent changes and committing backend actions, indicating key weaknesses in state management and action execution

The authors evaluate seven frontier models on a benchmark covering specialized professional scenarios, with results showing that no model achieves high overall performance across all tasks. The top models cluster closely in overall score, but their performance varies significantly by scenario, with no single model excelling across all domains. The evaluation highlights that even the best models struggle with end-to-end task completion, and performance drops sharply after exogenous environmental changes, indicating limited recovery capabilities. No single model dominates all scenarios, with top performers varying by task type and showing no consistent leadership across domains. Performance declines significantly after the first environmental change, with most models failing to recover fully by the final day of the multi-turn evaluation. The two highest-scoring models are also the most efficient in terms of action cost, suggesting a positive correlation between overall performance and computational efficiency.

The authors evaluate seven frontier models on the ClawMark benchmark, which measures coworker-agent behavior across multi-turn, multi-day workflows with exogenous environment changes and multimodal evidence. Results show that no model achieves high task success, with the top model solving only about 20% of tasks fully, and performance varies significantly across scenarios, indicating specialization and room for improvement in handling structural failures like silent-change detection and backend writeback. No model achieves high task success, with the top performer solving only a small fraction of tasks fully. Performance varies across scenarios, with different models excelling in different domains, indicating specialization. Failures concentrate on structural challenges like silent-change detection and backend writeback, which are critical for real-world trustworthiness.

The authors analyze failure modes in a benchmark evaluating coworker-agent behavior, focusing on how models handle exogenous state changes and compliance constraints. The results show that failures are concentrated in two primary categories: silent-change detection and backend writeback, which have significantly higher failure rates than other evaluation types. Scenario-specific checks account for the majority of failures due to their high number rather than increased brittleness, while red-line violations, though rare, are concentrated in a small set of tasks and models. Silent-change detection and backend writeback are the dominant failure modes, with failure rates nearly double the benchmark-wide average. Scenario-specific checks contribute the largest share of failures, driven by their high number rather than higher per-evaluation brittleness. Red-line violations are rare overall but concentrated in specific tasks and models, indicating potential safety risks in otherwise high-performing trajectories.

The evaluation assesses multiple frontier models on a benchmark designed to validate coworker-agent capabilities across multi-turn, multi-day professional workflows that incorporate exogenous environmental shifts and compliance constraints. Results indicate that no single model demonstrates consistent superiority across all domains, revealing clear performance specialization rather than uniform capability. Models experience significant performance degradation following initial environmental changes and generally fail to recover to baseline levels throughout extended interactions. The primary weaknesses concentrate on detecting silent state updates and executing backend actions, highlighting critical gaps in state management and long-term reliability that limit overall task success.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
ClawMark: 다중 턴, 다중 일일, 다중 모달 coworker agents를 위한 라이브 월드 벤치마크 | 문서 | HyperAI초신경