ShowUI: A Vision-language-action Model Focusing on GUI Automation


Tutorial Introduction
ShowUI is a vision-language-action model jointly developed by the Show Lab of the National University of Singapore and Microsoft in 2024. It is designed for graphical user interface (GUI) intelligent assistants and aims to improve the efficiency of human work. The relevant paper results are "ShowUI: One Vision-Language-Action Model for GUI Visual AgentThis model understands the content of the screen interface and performs interactive actions such as clicking, inputting, and scrolling. It supports web and mobile application scenarios and can automatically complete complex user interface tasks. ShowUI can parse screenshots and user instructions to predict interactive actions on the interface.
该教程是 ShowUI 的一个演示 demo,算力资源采用 RTX 4090 。只需提供图片和任务指令,无论是在手机电脑上的截图还是其他类型的图片,ShowUI 都可以指出操作位置。
Effect display

Running method (it takes about 15 seconds to initialize after starting the container, and then perform the following operations)
1. After cloning and starting the container, hover the mouse over the API address and click the arrow that appears. If it shows "Bad Gateway", this means the model is initializing. Please wait about 30 seconds and try again.

An example of a successfully opened interface is shown below:

2. After entering the demo page, upload the image and enter the command in the input box, and click "Submit". The red dot on the generated image marks the operation area, and the position coordinates of the red dot will be displayed below.

Discussion and Exchange
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [Tutorial Exchange] to join the group to discuss various technical issues and share application effects↓
