Command Palette
Search for a command to run...
CUA-Suite: 컴퓨터 사용 에이전트를 위한 대규모 인간 주석 비디오 데모
CUA-Suite: 컴퓨터 사용 에이전트를 위한 대규모 인간 주석 비디오 데모
Xiangru Jian Shravan Nayak Kevin Qinghong Lin Aarash Feizi Kaixin Li Patrice Bechard Spandana Gella Sai Rajeswar
초록
컴퓨터 사용 에이전트 (CUA) 는 복잡한 데스크톱 워크플로우의 자동화에서 큰 잠재력을 지니고 있으나, 범용 에이전트 개발의 진전은 연속적이고 고품질의 인간 시연 영상 부족으로 인해 병목 현상을 겪고 있습니다. 최근 연구들은 희소한 스크린샷이 아닌 연속 영상이 이러한 에이전트의 확장에 있어 결정적으로 부족한 핵심 요소임을 강조합니다. 그러나 현재까지 공개된 가장 대규모 데이터셋인 ScaleCUA 는 200 만 장의 스크린샷만을 포함하고 있으며, 이는 20 시간 미만의 영상에 해당합니다. 이러한 병목 현상을 해결하기 위해 우리는 전문가 수준의 데스크톱 CUA 를 위한 대규모 비디오 시연 및 밀도 높은 주석으로 구성된 생태계인 CUA-Suite 를 소개합니다. CUA-Suite 의 핵심은 VideoCUA 로, 87 가지 다양한 애플리케이션에 걸쳐 약 10,000 개의 인간 시연 태스크를 제공하며, 30fps 의 연속 스크린 레코딩, 운동학적 커서 궤적, 그리고 다층적 추론 주석을 포함하고 있습니다. 이는 총 약 55 시간 분량, 600 만 프레임에 달하는 전문가 영상 자원을 구성합니다. 최종 클릭 좌표만을 포착하는 희소 데이터셋과 달리, 이러한 연속 영상 스트림은 인간 상호작용의 완전한 시간적 역학을 보존하며, 기존 에이전트 프레임워크에서 요구하는 형식으로 손실 없이 변환 가능한 정보의 초집합을 형성합니다. CUA-Suite 는 두 가지 상호 보완적인 리소스를 추가로 제공합니다: 첫 번째는 CUA 의 그라운딩 및 계획 능력을 평가하기 위한 엄격한 벤치마크인 UI-Vision 이며, 두 번째는 56,000 개의 주석 달린 스크린샷과 360 만 개 이상의 UI 요소 주석을 포함하는 대규모 그라운딩 데이터셋인 GroundCUA 입니다. 예비 평가 결과, 현재 기반 행동 모델 (foundation action models) 은 전문 데스크톱 애플리케이션에서 상당한 어려움을 겪고 있음이 드러났습니다 (약 60% 의 태스크 실패율). 평가 beyond, CUA-Suite 의 풍부한 멀티모달 코퍼스는 범용 스크린 파싱, 연속 공간 제어, 비디오 기반 보상 모델링, 시각적 세계 모델 등 신흥 연구 방향을 지원합니다. 모든 데이터와 모델은 공개되었습니다.
One-sentence Summary
Researchers from ServiceNow, Mila, and other institutions introduce CUA-SUITE, a large-scale ecosystem featuring VIDEOCUA, which offers continuous 30 fps screen recordings and dense reasoning annotations to overcome the scarcity of high-quality human demonstrations for training general-purpose computer-use agents.
Key Contributions
- The paper introduces VIDEOCUA, a large-scale corpus of approximately 55 hours of continuous 30 fps expert video recordings covering 10,000 tasks across 87 desktop applications, enriched with kinematic cursor traces and multi-layered reasoning annotations to preserve full temporal dynamics.
- This work unifies continuous video demonstrations with pixel-precise UI grounding data from GROUNDCUA and a rigorous evaluation benchmark called UI-VISION into the CUA-SUITE ecosystem to provide dense, causal supervision for training and testing computer-use agents.
- All benchmarks, training data, and models associated with the CUA-SUITE framework are released as open-source resources to support emerging research directions such as generalist screen parsing, continuous spatial control, and visual world models.
Introduction
Computer-use agents aim to transform digital tools into active collaborators capable of navigating complex interfaces and executing workflows, yet current models remain brittle when handling professional desktop applications. Prior efforts to address this rely on automatically generated data that introduces noise or sparse screenshot-based datasets that lack the temporal continuity needed for learning smooth cursor movements and long-horizon planning. To overcome these limitations, the authors introduce CUA-SUITE, a comprehensive ecosystem that unifies 55 hours of high-fidelity expert video demonstrations with pixel-precise UI annotations and a rigorous evaluation benchmark. This resource provides dense, causal supervision across 87 applications, enabling the training of foundation action models that can master continuous spatial control and complex reasoning in real-world software environments.
Dataset
-
Dataset Composition and Sources
- The authors introduce CUA-SUITE, a unified ecosystem built from high-fidelity human demonstrations across 87 diverse open-source desktop applications.
- The suite comprises three complementary resources: VIDEOCUA for continuous video training, GROUNDCUA for fine-grained UI grounding, and UI-VISION for benchmarking visual perception and planning.
- Data collection involved approximately 70 professional annotators who designed and executed over 10,000 expert tasks ranging from simple actions to complex workflows.
-
Key Details for Each Subset
- VIDEOCUA: Contains approximately 55 hours of continuous 30 fps video (6 million frames) covering 10,000 tasks. It includes synchronized kinematic cursor traces and multi-layered reasoning annotations averaging 497 words per step.
- GROUNDCUA: A training corpus derived from the video data, featuring 56,000 annotated screenshots with over 3.6 million UI element annotations. It includes bounding boxes, textual labels, and functional categories for 50% of elements.
- UI-VISION: A benchmark dataset consisting of 450 high-quality task demonstrations designed to evaluate element grounding, layout grounding, and action prediction capabilities.
-
Data Usage and Processing
- The authors utilize VIDEOCUA as a high-quality expansion for training generalist Computer-Use Agents, ensuring compatibility with existing frameworks like OpenCUA and ScaleCUA.
- Multi-layered reasoning annotations are synthesized using Claude-Sonnet-4.5 to generate observation, thought chain, action description, and reflection layers for each trajectory step.
- GROUNDCUA supports a two-stage training recipe involving supervised fine-tuning (SFT) followed by reinforcement learning (RL) to train efficient vision-language models like GROUND-NEXT.
- UI-VISION serves as the primary evaluation metric to diagnose bottlenecks in visual grounding and planning, revealing that spatial reasoning remains a significant challenge for current models.
-
Cropping, Metadata, and Annotation Strategy
- Keyframes are extracted from continuous video streams specifically at moments immediately preceding state-changing user actions to capture the decision-making context.
- Annotators manually label every visible UI element in these keyframes with bounding boxes and provide textual labels or concise summaries for long text segments.
- OCR via PaddleOCR is applied to extract raw text for lengthy content like source code, supplementing manual summaries.
- The dataset preserves full temporal dynamics and intermediate cursor movements, allowing for lossless transformation into various agent training formats such as screenshot-action pairs or continuous kinematic traces.
Method
The authors present the CUA-Suite, a unified framework designed to facilitate the development of generalist computer-use agents through massive-scale software coverage and dense unified annotations. The architecture integrates a rigorous data creation pipeline with specialized modules for visual understanding and trajectory modeling.
The data creation process is structured into five sequential stages. It initiates with Human Annotator UI Training to establish baseline proficiency, followed by UI Task Execution on target software. During execution, Screen Recording and Action Logs are captured. Annotators then process these logs by Annotating Keyframes with bounding boxes, OCR data, and interaction details. The pipeline concludes with Quality Assurance, where expert human review verifies the annotations.
The suite comprises three core components. UI-Vision handles Action Prediction, Element Grounding, and Layout Grounding, allowing agents to interpret interface elements and predict spatial coordinates. GroundCUA focuses on Computer Use Instructions, providing examples for tasks like highlighting specific UI regions or selecting color swatches. VideoCUA utilizes 55 hours of human demonstrations to model Trajectories, decomposing tasks into steps containing observations, thoughts, reflections, and actions.
To ensure robust evaluation and prevent information leakage regarding cursor positions, the authors employ specific preprocessing strategies. Keyframe Extraction is performed at the temporal midpoint between consecutive actions. For an action at with timestamp τt, the keyframe is captured at (τt−1+τt)/2. This ensures the cursor has not yet reached the target location, providing a fairer assessment of spatial grounding. Furthermore, the authors implement moveTo Handling by excluding moveTo steps from the evaluation and action history, as these are preparatory movements. For click actions that directly follow a moveTo, the keyframe from the moveTo step is used instead of the click step's keyframe to avoid revealing the target position.
Experiment
- Action prediction experiments on 256 tasks across 87 desktop applications validate that current foundation models struggle with complex, multi-panel interfaces, achieving modest accuracy even with model scaling from 7B to 32B parameters.
- Qualitative analysis reveals that models frequently fail to disambiguate visually similar elements in specialized creative tools and canvas-based applications, often resulting in cross-panel errors or incorrect UI region selection.
- Human evaluation confirms that while models generally identify the correct action intent, they lack precision in spatial grounding, leading to a significant gap between action correctness and coordinate accuracy.
- Application-level analysis demonstrates that performance is highly dependent on interface design, with web-like layouts yielding higher success rates compared to dense, non-standard toolbars found in professional software.
- Detailed trajectory case studies in Krita and GIMP illustrate that agents can successfully execute multi-step workflows involving tool selection, shape creation, and effect application, though they remain prone to coordinate misalignment and redundant actions during complex interactions.