Home News Latest Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Natural Language Visual Grounding On

Metrics

Accuracy (%)

Results

Performance results of various models on this benchmark

Model Name	Accuracy (%)	Paper Title	Repository
OS-Atlas-Base-7B	82.47	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Aguvis-7B	83.0	Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	-
Qwen-GUI	28.6	GUICourse: From General Vision Language Models to Versatile GUI Agents
UGround	73.3	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
OS-Atlas-Base-4B	68.0	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
ShowUI	75.1	ShowUI: One Vision-Language-Action Model for GUI Visual Agent
UGround-V1-7B	86.34	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Groma	5.2	Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
ShowUI-G	75.0	ShowUI: One Vision-Language-Action Model for GUI Visual Agent
MiniGPT-v2	5.7	MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Aria-UI	81.1	Aria-UI: Visual Grounding for GUI Instructions	-
Qwen2-VL-7B	42.1	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
OmniParser	73.0	OmniParser for Pure Vision Based GUI Agent
UGround-V1-2B	77.67	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Qwen-VL	5.2	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
SeeClick	53.4	SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Aguvis-G-7B	81.0	Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	-
CogAgent	47.4	CogAgent: A Visual Language Model for GUI Agents

0 of 18 row(s) selected.