2 months ago

OmniParser for Pure Vision Based GUI Agent

Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah

Abstract

The recent success of large vision language models shows great potential indriving the agent system operating on user interfaces. However, we argue thatthe power multimodal models like GPT-4V as a general agent on multipleoperating systems across different applications is largely underestimated dueto the lack of a robust screen parsing technique capable of: 1) reliablyidentifying interactable icons within the user interface, and 2) understandingthe semantics of various elements in a screenshot and accurately associate theintended action with the corresponding region on the screen. To fill thesegaps, we introduce OmniParser, a comprehensive method for parsing userinterface screenshots into structured elements, which significantly enhancesthe ability of GPT-4V to generate actions that can be accurately grounded inthe corresponding regions of the interface. We first curated an interactableicon detection dataset using popular webpages and an icon description dataset.These datasets were utilized to fine-tune specialized models: a detection modelto parse interactable regions on the screen and a caption model to extract thefunctional semantics of the detected elements. OmniParsersignificantly improves GPT-4V's performance on ScreenSpot benchmark. And onMind2Web and AITW benchmark, OmniParser with screenshot only inputoutperforms the GPT-4V baselines requiring additional information outside ofscreenshot.