2 days ago

GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li

View Paper Details View Code

Abstract

Graphical user interface (GUI) agents autonomously operate across platforms(e.g., Linux) to complete tasks by interacting with visual elements.Specifically, a user instruction is decomposed into a sequence of actionproposals, each corresponding to an interaction with the GUI. After eachaction, the agent observes the updated GUI environment to plan the next step.However, two main challenges arise: i) resolving ambiguity in task planning(i.e., the action proposal sequence), where selecting an appropriate plan isnon-trivial, as many valid ones may exist; ii) accurately grounding actions incomplex and high-resolution interfaces, i.e., precisely interacting with visualtargets. This paper investigates the two aforementioned challenges with our GUITest-time Scaling Agent, namely GTA1. First, to select the most appropriateaction proposal, we introduce a test-time scaling method. At each step, wesample multiple candidate action proposals and leverage a judge model toevaluate and select the most suitable one. It trades off computation for betterdecision quality by concurrent sampling, shortening task execution steps, andimproving overall performance. Second, we propose a model that achievesimproved accuracy when grounding the selected action proposal to itscorresponding visual elements. Our key insight is that reinforcement learning(RL) facilitates visual grounding through inherent objective alignments,rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance acrossdiverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7%accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. Whenpaired with a planner applying our test-time scaling strategy, it exhibitsstate-of-the-art agentic performance (e.g., 45.2% task success rate onOSWorld). We open-source our code and models here.