HyperAI超神経

Phrase Grounding On Flickr30K Entities Test

評価指標

R@1

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
R@1
Paper TitleRepository
MCB48.69Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
DIGN78.73Disentangled Motif-aware Graph Learning for Phrase Grounding-
VisualBERT71.33VisualBERT: A Simple and Performant Baseline for Vision and Language
LCMCG76.74Learning Cross-modal Context Graph for Visual Grounding
Soft-Label Chain CRF (SL-CCRF)74.69Phrase Grounding by Soft-Label Chain Conditional Random Field
SCRC27.8Natural Language Object Retrieval
DDPN (ResNet-101)73.3Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding
DSPE43.89Learning Deep Structure-Preserving Image-Text Embeddings-
GLIPv287.7GLIPv2: Unifying Localization and Vision-Language Understanding
CCA25.30Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
CCA - Fast RCNN41.77Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
MDETR-ENB584.3MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
GLIP87.1Grounded Language-Image Pre-training
PEVL84.4PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
CCA - VGG1930.83Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
BAN (Bottom-Up detector)69.69Bilinear Attention Networks
FIBER-B87.4Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
GroundeR 100.0% annot.48.38Grounding of Textual Phrases in Images by Reconstruction
0 of 18 row(s) selected.