OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Existing efforts in building GUI agents heavily rely on the availability ofrobust commercial Vision-Language Models (VLMs) such as GPT-4o andGeminiProVision. Practitioners are often reluctant to use open-source VLMs dueto their significant performance lag compared to their closed-sourcecounterparts, particularly in GUI grounding and Out-Of-Distribution (OOD)scenarios. To facilitate future research in this area, we developed OS-Atlas -a foundational GUI action model that excels at GUI grounding and OOD agentictasks through innovations in both data and modeling. We have investedsignificant engineering effort in developing an open-source toolkit forsynthesizing GUI grounding data across multiple platforms, including Windows,Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasingthe largest open-source cross-platform GUI grounding corpus to date, whichcontains over 13 million GUI elements. This dataset, combined with innovationsin model training, provides a solid foundation for OS-Atlas to understand GUIscreenshots and generalize to unseen interfaces. Through extensive evaluationacross six benchmarks spanning three different platforms (mobile, desktop, andweb), OS-Atlas demonstrates significant performance improvements over previousstate-of-the-art models. Our evaluation also uncovers valuable insights intocontinuously improving and scaling the agentic capabilities of open-sourceVLMs.