HyperAIHyperAI
2 months ago

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu, Yiheng ; Wang, Zekun ; Wang, Junli ; Lu, Dunjie ; Xie, Tianbao ; Saha, Amrita ; Sahoo, Doyen ; Yu, Tao ; Xiong, Caiming
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Abstract

Automating GUI tasks remains challenging due to reliance on textualrepresentations, platform-specific action spaces, and limited reasoningcapabilities. We introduce Aguvis, a unified vision-based framework forautonomous GUI agents that directly operates on screen images, standardizescross-platform interactions and incorporates structured reasoning via innermonologue. To enable this, we construct Aguvis Data Collection, a large-scaledataset with multimodal grounding and reasoning annotations, and develop atwo-stage training pipeline that separates GUI grounding from planning andreasoning. Experiments show that Aguvis achieves state-of-the-art performanceacross offline and real-world online benchmarks, marking the first fullyautonomous vision-based GUI agent that operates without closed-source models.We open-source all datasets, models, and training recipes athttps://aguvis-project.github.io to advance future research.

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction | Latest Papers | HyperAI