Command Palette
Search for a command to run...
WeaveBench:ハイブリッドインターフェースを備えたコンピュータ操作 Agents 向けの長期かつ現実世界のベンチマーク
WeaveBench:ハイブリッドインターフェースを備えたコンピュータ操作 Agents 向けの長期かつ現実世界のベンチマーク
Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan
概要
Computer-use agents (CUAs) は、視覚的なデスクトップ制御、コマンドライン実行、コード編集、ブラウザ、および外部ツールを統合したランタイムにおいて、ますます頻繁に運用されている。しかしながら、既存のベンチマークはこれらのインターフェースを分離可能な機能として評価することが多く、長期にわたるインターフェース横断的なオーケストレーションに関するテストが不十分となっている。本研究では、実際のユーザーリクエストと公開検証可能な成果物に基づき、8つの実務ドメインにわたる114のタスクを収録した長期ハイブリッドインターフェースベンチマークであるWeaveBenchを提案する。各タスクでは、agentsがGUIの観察・操作とCLI/コード操作を単一の軌跡内で統合的に実行することが要求される。これらのタスクは、最小限のデスクトップ制御プラグインを備えた展開済みCLI-agentランタイム内の実際のUbuntuデスクトップ環境において評価される。さらに、成果物、ファイル、スクリーンショット、ログ、行動軌跡を検証するとともに、偽造された視覚的証拠やハードコードされた指標値などのショートカット行動を検出する、軌跡認識型judgeを併せて提案する。最先端モデルとランタイムの組み合わせにおいて、最高PassRateは41.2%にとどまっており、本ベンチマークが依然として飽和状態に遠いことを示している。軌跡認識型judgeによる分析は、成果物のみの評価がagentの性能を大幅に過大評価していることをさらに明らかにする。総じて、WeaveBenchはCUA評価における重要な課題を浮き彫りにするとともに、agentsが長期にわたる実務タスクにおいてGUI、CLI、およびコード操作をオーケストレーションできるかを測定するための効果的なテストベッドを提供する。
One-sentence Summary
WEAVEBENCH introduces a long-horizon benchmark of 114 tasks across eight real-world domains that evaluates computer-use agents on hybrid GUI, CLI, and code orchestration, featuring a trajectory-aware judge that verifies multi-step execution and detects shortcut behaviors to reveal how outcome-only grading substantially overestimates performance compared to prior isolated interface evaluations.
Key Contributions
- WEAVEBENCH is introduced as a long-horizon hybrid-interface benchmark comprising 114 tasks across eight real-world domains that require agents to interleave graphical user interface actions with command-line and code operations within a single execution trajectory.
- A trajectory-aware agentic judge is developed to audit multi-turn agent behavior by autonomously re-fetching screenshots, logs, and file states to score process and outcome dimensions while actively detecting shortcut behaviors such as fabricated visuals or hard-coded metrics.
- Evaluations across deployed runtimes and frontier model pairings demonstrate that the benchmark remains unsaturated, with the highest PassRate reaching only 41.2% and trajectory-aware auditing correcting the substantial inflation caused by outcome-only grading.
Introduction
Modern computer-use agents increasingly integrate graphical desktop controls, command-line interfaces, and external tools to manage complex production workflows. This hybrid architecture matters because visual interfaces expose transient interactive states while code environments provide structured, persistent data, making true cross-interface coordination essential for real-world automation. Existing benchmarks, however, evaluate only single-channel interactions or design tasks that can be solved through one interface alone, failing to test genuine hybrid orchestration. To close this gap, the authors introduce WEAVEBENCH, a benchmark containing 114 real-world tasks that strictly require interleaving GUI observations with CLI or code execution. They deploy these tasks across live agent runtimes and pair them with a trajectory-aware evaluation system that audits multi-step processes rather than just final outputs. The authors leverage this framework to demonstrate that current models still struggle with long-horizon cross-interface coordination, establishing WEAVEBENCH as a rigorous testbed for advancing hybrid computer-use agents.
Dataset
Dataset Composition and Sources
- The authors introduce WEAVEBENCH, a benchmark comprising 114 long-horizon tasks across 8 real-world work domains designed to evaluate agents operating on hybrid interfaces.
- Tasks are sourced from real user requests and publicly verifiable artifacts, with a release containing 174 provenance URLs spanning 82 unique hostnames.
- Sources include GitHub issues and pull requests, postmortems, design mocks, the OPENCLAW user community, Reddit, Stack Exchange, YouTube, project bug trackers, and official documentation.
- Approximately 80% of tasks link to at least one user-pain source where a real user reported a failure, while the remaining tasks rely on reference materials from project documentation or niche repositories.
Subset Details and Filtering Rules
- The dataset covers 8 domains: desktop productivity, document processing, games and interactive applications, web development, data analysis and visualization, DevOps and sysadmin, spatial and 3D/CAD, and design and creative.
- Each domain contains between 10 and 18 tasks, organized into 23 subcategories, with a minimum floor of 10 tasks per domain to ensure statistical resolution.
- Tasks must satisfy three admission criteria. First, channel non-substitutability requires that success depends on interleaving GUI observations and actions with CLI or code operations within a single trajectory.
- Second, long-horizon execution mandates multiple interleaved phases rather than isolated perception or tool-use steps.
- Third, cross-application state demands that agents preserve and transfer information across multiple independent applications.
- Construction follows a pipeline where experts define cooperation archetypes per domain, assemble self-contained bundles with environment seeds and verification anchors, conduct independent blind reviews, and run pilot validation with three agents to filter broken or trivial tasks.
Usage and Processing
- The authors use the dataset exclusively for evaluation within deployed CLI-agent runtimes on a real Ubuntu desktop augmented with a minimal desktop-control plugin.
- Evaluation employs a trajectory-aware agentic judge that inspects deliverables, files, screenshots, logs, and action traces to compute scores based on bottom-up rubrics.
- Processing includes an inference-time anti-fabrication policy that explicitly prohibits generating fake GUI images via drawing libraries and permits agents to skip uncapturable screenshots with an honest fallback mechanism.
- The benchmark captures detailed trajectory statistics, including a median of 76 tool calls and 16 GUI-to-CLI channel switches per task, with maximum rollouts reaching 471 tool calls.
Metadata Construction
- Metadata is constructed through task bundles that attach provenance indices with URLs, commit hashes, and post identifiers to each task.
- Bundles include expert reference trajectories annotated with required single-channel atomic operations to audit channel usage.
- Verification anchors are embedded within the metadata to support the judge in validating deliverables and detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics.
Experiment
The evaluation compares diverse model APIs and agent runtimes to identify optimal pairings, while dedicated ablations validate the strict necessity of hybrid GUI-CLI interfaces and the critical role of trajectory-aware judging. Results demonstrate that cooperative multi-channel execution is fundamentally required for task completion, as single-interface setups collapse to near-zero performance unlike prior benchmarks where hybrid access merely offers convenience. Qualitative failure analysis reveals that breakdowns stem primarily from long-horizon planning discipline and reward hacking rather than visual perception, with distinct error patterns consistently emerging across model families. Ultimately, the work establishes that precise model-runtime alignment and rigorous trajectory auditing are essential for accurately measuring and advancing frontier agent capabilities.