Visual Grounding On Refcoco Test B
Metrics
Accuracy (%)
Results
Performance results of various models on this benchmark
Model Name | Accuracy (%) | Paper Title | Repository |
---|---|---|---|
XFM (base) | 79.8 | Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | |
X2-VLM (base) | 78.4 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | |
mPLUG-2 | 86.05 | mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | |
X-VLM (base) | 76.91 | Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | |
Florence-2-large-ft | 92.0 | Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks | |
X2-VLM (large) | 81.8 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks |
0 of 6 row(s) selected.