Command Palette
Search for a command to run...
WeaveBench: معيار طويل المدى وواقعي لـ Agents استخدام الحاسوب ذات الواجهات الهجينة
WeaveBench: معيار طويل المدى وواقعي لـ Agents استخدام الحاسوب ذات الواجهات الهجينة
Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan
الملخص
تعمل agents استخدام الحاسوب (CUAs) بشكل متزايد ضمن بيئات تشغيل تجمع بين التحكم البصري في سطح المكتب، وتنفيذ أوامر سطر الأوامر، وتحرير الأكواد، والمتصفحات، والأدوات الخارجية. غير أن الاختبارات المعيارية القائمة غالباً ما تقيّم هذه الواجهات بوصفها قدرات قابلة للفصل، مما يترك عملية التنسيق عبر الواجهات على المدى الطويل دون اختبارٍ كافٍ. وعليه، نقدم WeaveBench، وهو اختبار معياري طويل المدى للواجهات الهجينة يتضمن 114 مهمة موزعة عبر 8 مجالات عمل واقعية، ويستند إلى طلبات مستخدمين حقيقية وأدلة قابلة للتحقق علناً. وتتطلب كل مهمة من agents دمج الملاحظات والإجراءات الخاصة بواجهة المستخدم الرسومية (GUI) مع عمليات سطر الأوامر (CLI) وتحرير الأكواد ضمن مسار تنفيذي واحد. ونقوم بتقييم هذه المهام على سطح مكتب حقيقي يعمل بنظام Ubuntu داخل بيئات تشغيل مُنشرة لـ agents سطر الأوامر، مع تعزيزها بملحق بسيط للتحكم في سطح المكتب. كما نقترح حكماً مرافقاً واعياً بالمسار يفحص المخرجات والملفات ولقطات الشاشة وسجلات النظام وآثار الإجراءات، مع كشف سلوكيات الاختصار مثل الأدلة البصرية المصطنعة أو المقاييس الثابتة. وعلى مستوى أزواج النماذج المتطورة وبيئات التشغيل، لم يتجاوز أفضل معدل نجاح (PassRate) نسبة 41.2% فقط، مما يدل على أن الاختبار المعياري لا يزال بعيداً عن مرحلة التشبع. ويكشف الحكم الواعي بالمسار أيضاً أن التقييم القائم على النتائج فقط يبالغ بشكل كبير في تقدير أداء agents. وبشكل عام، يكشف WeaveBench عن فجوة حرجة في تقييم CUAs، ويوفر بيئة اختبار فعالة لقياس قدرة agents على التنسيق بين عمليات واجهة المستخدم الرسومية (GUI) وسطر الأوامر (CLI) وتحرير الأكواد عبر مهام واقعية طويلة المدى.
One-sentence Summary
WEAVEBENCH introduces a long-horizon benchmark of 114 tasks across eight real-world domains that evaluates computer-use agents on hybrid GUI, CLI, and code orchestration, featuring a trajectory-aware judge that verifies multi-step execution and detects shortcut behaviors to reveal how outcome-only grading substantially overestimates performance compared to prior isolated interface evaluations.
Key Contributions
- WEAVEBENCH is introduced as a long-horizon hybrid-interface benchmark comprising 114 tasks across eight real-world domains that require agents to interleave graphical user interface actions with command-line and code operations within a single execution trajectory.
- A trajectory-aware agentic judge is developed to audit multi-turn agent behavior by autonomously re-fetching screenshots, logs, and file states to score process and outcome dimensions while actively detecting shortcut behaviors such as fabricated visuals or hard-coded metrics.
- Evaluations across deployed runtimes and frontier model pairings demonstrate that the benchmark remains unsaturated, with the highest PassRate reaching only 41.2% and trajectory-aware auditing correcting the substantial inflation caused by outcome-only grading.
Introduction
Modern computer-use agents increasingly integrate graphical desktop controls, command-line interfaces, and external tools to manage complex production workflows. This hybrid architecture matters because visual interfaces expose transient interactive states while code environments provide structured, persistent data, making true cross-interface coordination essential for real-world automation. Existing benchmarks, however, evaluate only single-channel interactions or design tasks that can be solved through one interface alone, failing to test genuine hybrid orchestration. To close this gap, the authors introduce WEAVEBENCH, a benchmark containing 114 real-world tasks that strictly require interleaving GUI observations with CLI or code execution. They deploy these tasks across live agent runtimes and pair them with a trajectory-aware evaluation system that audits multi-step processes rather than just final outputs. The authors leverage this framework to demonstrate that current models still struggle with long-horizon cross-interface coordination, establishing WEAVEBENCH as a rigorous testbed for advancing hybrid computer-use agents.
Dataset
Dataset Composition and Sources
- The authors introduce WEAVEBENCH, a benchmark comprising 114 long-horizon tasks across 8 real-world work domains designed to evaluate agents operating on hybrid interfaces.
- Tasks are sourced from real user requests and publicly verifiable artifacts, with a release containing 174 provenance URLs spanning 82 unique hostnames.
- Sources include GitHub issues and pull requests, postmortems, design mocks, the OPENCLAW user community, Reddit, Stack Exchange, YouTube, project bug trackers, and official documentation.
- Approximately 80% of tasks link to at least one user-pain source where a real user reported a failure, while the remaining tasks rely on reference materials from project documentation or niche repositories.
Subset Details and Filtering Rules
- The dataset covers 8 domains: desktop productivity, document processing, games and interactive applications, web development, data analysis and visualization, DevOps and sysadmin, spatial and 3D/CAD, and design and creative.
- Each domain contains between 10 and 18 tasks, organized into 23 subcategories, with a minimum floor of 10 tasks per domain to ensure statistical resolution.
- Tasks must satisfy three admission criteria. First, channel non-substitutability requires that success depends on interleaving GUI observations and actions with CLI or code operations within a single trajectory.
- Second, long-horizon execution mandates multiple interleaved phases rather than isolated perception or tool-use steps.
- Third, cross-application state demands that agents preserve and transfer information across multiple independent applications.
- Construction follows a pipeline where experts define cooperation archetypes per domain, assemble self-contained bundles with environment seeds and verification anchors, conduct independent blind reviews, and run pilot validation with three agents to filter broken or trivial tasks.
Usage and Processing
- The authors use the dataset exclusively for evaluation within deployed CLI-agent runtimes on a real Ubuntu desktop augmented with a minimal desktop-control plugin.
- Evaluation employs a trajectory-aware agentic judge that inspects deliverables, files, screenshots, logs, and action traces to compute scores based on bottom-up rubrics.
- Processing includes an inference-time anti-fabrication policy that explicitly prohibits generating fake GUI images via drawing libraries and permits agents to skip uncapturable screenshots with an honest fallback mechanism.
- The benchmark captures detailed trajectory statistics, including a median of 76 tool calls and 16 GUI-to-CLI channel switches per task, with maximum rollouts reaching 471 tool calls.
Metadata Construction
- Metadata is constructed through task bundles that attach provenance indices with URLs, commit hashes, and post identifiers to each task.
- Bundles include expert reference trajectories annotated with required single-channel atomic operations to audit channel usage.
- Verification anchors are embedded within the metadata to support the judge in validating deliverables and detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics.
Experiment
The evaluation compares diverse model APIs and agent runtimes to identify optimal pairings, while dedicated ablations validate the strict necessity of hybrid GUI-CLI interfaces and the critical role of trajectory-aware judging. Results demonstrate that cooperative multi-channel execution is fundamentally required for task completion, as single-interface setups collapse to near-zero performance unlike prior benchmarks where hybrid access merely offers convenience. Qualitative failure analysis reveals that breakdowns stem primarily from long-horizon planning discipline and reward hacking rather than visual perception, with distinct error patterns consistently emerging across model families. Ultimately, the work establishes that precise model-runtime alignment and rigorous trajectory auditing are essential for accurately measuring and advancing frontier agent capabilities.