8 months ago

Human-Computer Interaction

Method/Architecture

Xu Yiheng ; Wang Zekun ; Wang Junli ; Lu Dunjie ; Xie Tianbao ; Saha Amrita ; Sahoo Doyen ; Yu Tao ; Xiong Caiming

Abstract

Automating GUI tasks remains challenging due to reliance on textualrepresentations, platform-specific action spaces, and limited reasoningcapabilities. We introduce Aguvis, a unified vision-based framework forautonomous GUI agents that directly operates on screen images, standardizescross-platform interactions and incorporates structured reasoning via innermonologue. To enable this, we construct Aguvis Data Collection, a large-scaledataset with multimodal grounding and reasoning annotations, and develop atwo-stage training pipeline that separates GUI grounding from planning andreasoning. Experiments show that Aguvis achieves state-of-the-art performanceacross offline and real-world online benchmarks, marking the first fullyautonomous vision-based GUI agent that operates without closed-source models.We open-source all datasets, models, and training recipes athttps://aguvis-project.github.io to advance future research.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Human-Computer Interaction

Method/Architecture

Xu Yiheng ; Wang Zekun ; Wang Junli ; Lu Dunjie ; Xie Tianbao ; Saha Amrita ; Sahoo Doyen ; Yu Tao ; Xiong Caiming

Abstract

Automating GUI tasks remains challenging due to reliance on textualrepresentations, platform-specific action spaces, and limited reasoningcapabilities. We introduce Aguvis, a unified vision-based framework forautonomous GUI agents that directly operates on screen images, standardizescross-platform interactions and incorporates structured reasoning via innermonologue. To enable this, we construct Aguvis Data Collection, a large-scaledataset with multimodal grounding and reasoning annotations, and develop atwo-stage training pipeline that separates GUI grounding from planning andreasoning. Experiments show that Aguvis achieves state-of-the-art performanceacross offline and real-world online benchmarks, marking the first fullyautonomous vision-based GUI agent that operates without closed-source models.We open-source all datasets, models, and training recipes athttps://aguvis-project.github.io to advance future research.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp