Visual Grounding On Refcoco Val
المقاييس
Accuracy (%)
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
اسم النموذج | Accuracy (%) | Paper Title | Repository |
---|---|---|---|
X-VLM (base) | 84.51 | Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | |
X2-VLM (base) | 85.2 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | |
XFM (base) | 86.1 | Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | |
mPLUG-2 | 90.33 | mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | |
X2-VLM (large) | 87.6 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | |
Florence-2-large-ft | 93.4 | Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
0 of 6 row(s) selected.