Natural Language Visual Grounding On
Metrics
Accuracy (%)
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | Accuracy (%) |
---|---|
os-atlas-a-foundation-action-model-for | 82.47 |
aguvis-unified-pure-vision-agents-for | 83.0 |
guicourse-from-general-vision-language-models | 28.6 |
navigating-the-digital-world-as-humans-do | 73.3 |
os-atlas-a-foundation-action-model-for | 68.0 |
showui-one-vision-language-action-model-for | 75.1 |
navigating-the-digital-world-as-humans-do | 86.34 |
groma-localized-visual-tokenization-for | 5.2 |
showui-one-vision-language-action-model-for | 75.0 |
minigpt-v2-large-language-model-as-a-unified | 5.7 |
aria-ui-visual-grounding-for-gui-instructions | 81.1 |
qwen2-vl-enhancing-vision-language-model-s | 42.1 |
omniparser-for-pure-vision-based-gui-agent | 73.0 |
navigating-the-digital-world-as-humans-do | 77.67 |
qwen-vl-a-frontier-large-vision-language | 5.2 |
seeclick-harnessing-gui-grounding-for | 53.4 |
aguvis-unified-pure-vision-agents-for | 81.0 |
cogagent-a-visual-language-model-for-gui | 47.4 |