ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Screen user interfaces (UIs) and infographics, sharing similar visuallanguage and design principles, play important roles in human communication andhuman-machine interaction. We introduce ScreenAI, a vision-language model thatspecializes in UI and infographics understanding. Our model improves upon thePaLI architecture with the flexible patching strategy of pix2struct and istrained on a unique mixture of datasets. At the heart of this mixture is anovel screen annotation task in which the model has to identify the type andlocation of UI elements. We use these text annotations to describe screens toLarge Language Models and automatically generate question-answering (QA), UInavigation, and summarization training datasets at scale. We run ablationstudies to demonstrate the impact of these design choices. At only 5Bparameters, ScreenAI achieves new state-of-the-artresults on UI- andinfographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and WidgetCaptioning), and new best-in-class performance on others (Chart QA, DocVQA, andInfographicVQA) compared to models of similar size. Finally, we release threenew datasets: one focused on the screen annotation task and two others focusedon question answering.