HyperAIHyperAI

Command Palette

Search for a command to run...

12 days ago

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Abstract

Training computer-use agents requires massive amounts of GUI interactiondata, but manually annotating action trajectories at scale is prohibitivelyexpensive. We present VideoAgentTrek, a scalable pipeline that automaticallymines training data from publicly available screen-recorded videos at webscale, eliminating the need for manual annotation. Our approach addresses a keychallenge: raw videos contain implicit demonstrations but lack explicit actionlabels. To solve this, we develop Video2Action, an inverse dynamics module(IDM) with two components: (1) a video grounding model that detects andlocalizes GUI actions with precise temporal boundaries and context, and (2) anaction-content recognizer that extracts structured parameters like clickcoordinates and typed text with high fidelity. Applied to 39,000 YouTubetutorial videos, our pipeline generates 1.52 million interaction stepsautomatically. We leverage this data through continued pretraining followed bysupervised fine-tuning. On OSWorld-Verified, our approach improves task successrates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. OnAgentNetBench, step accuracy increases from 64.1% to 69.3%. Our resultsdemonstrate that passive internet videos can be transformed into high-qualitysupervision for computer-use agents, providing a scalable alternative toexpensive manual annotation.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos | Papers | HyperAI