HyperAIHyperAI
2 months ago

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Abstract

Building Graphical User Interface (GUI) assistants holds significant promisefor enhancing human workflow productivity. While most agents arelanguage-based, relying on closed-source API with text-rich meta-information(e.g., HTML or accessibility tree), they show limitations in perceiving UIvisuals as humans do, highlighting the need for GUI visual agents. In thiswork, we develop a vision-language-action model in digital world, namelyShowUI, which features the following innovations: (i) UI-Guided Visual TokenSelection to reduce computational costs by formulating screenshots as an UIconnected graph, adaptively identifying their redundant relationship and serveas the criteria for token selection during self-attention blocks; (ii)Interleaved Vision-Language-Action Streaming that flexibly unifies diverseneeds within GUI tasks, enabling effective management of visual-action historyin navigation or pairing multi-turn query-action sequences per screenshot toenhance training efficiency; (iii) Small-scale High-quality GUIInstruction-following Datasets by careful data curation and employing aresampling strategy to address significant data type imbalances. With abovecomponents, ShowUI, a lightweight 2B model using 256K data, achieves a strong75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selectionfurther reduces 33% of redundant visual tokens during training and speeds upthe performance by 1.4x. Navigation experiments across web Mind2Web, mobileAITW, and online MiniWob environments further underscore the effectiveness andpotential of our model in advancing GUI visual agents. The models are availableat https://github.com/showlab/ShowUI.

ShowUI: One Vision-Language-Action Model for GUI Visual Agent | Latest Papers | HyperAI