Visual Grounding On Refcoco Test B
评估指标
Accuracy (%)
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Accuracy (%) | Paper Title | Repository |
---|---|---|---|
XFM (base) | 79.8 | Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | |
X2-VLM (base) | 78.4 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | |
mPLUG-2 | 86.05 | mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | |
X-VLM (base) | 76.91 | Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | |
Florence-2-large-ft | 92.0 | Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks | |
X2-VLM (large) | 81.8 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks |
0 of 6 row(s) selected.