HyperAIHyperAI
17 days ago

VeriGUI: Verifiable Long-Chain GUI Dataset

Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao
VeriGUI: Verifiable Long-Chain GUI Dataset
Abstract

Recent studies have delved into constructing autonomous agents capable ofperforming complex Graphical User Interface (GUI)-based computer tasks, withthe potential to revolutionize human-computer interaction. Despite encouragingresults, existing efforts mainly focus on short-term interactions and rely onoutcome-only verification, thereby limiting their scalability in real-world GUIapplications that demand long-horizon task decomposition and execution. In thiswork, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designedto facilitate the development and evaluation of generalist GUI agents operatingin realistic computer environments. Our dataset emphasizes two criticaldimensions: (1) long-chain complexity, with tasks decomposed into a sequence ofinterdependent subtasks spanning hundreds of steps, explicitly designed toallow any subtask to serve as a valid starting point; and (2) subtask-levelverifiability, which enables diverse exploration strategies within eachsubtask, while ensuring that each subtask-level goal remains verifiable andconsistent. The dataset consists of GUI task trajectories across both desktopand web, annotated by human experts. Extensive experiments on VeriGUI usingvarious agents with different foundation models reveal significant performancegaps in handling long-horizon tasks, highlighting the need for more robustplanning and decision-making capabilities in GUI agents.