HyperAI超神経

Visual Question Answering On Vqa V2 Test Std

評価指標

overall

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
overall
Paper TitleRepository
LXMERT72.5LXMERT: Learning Cross-Modality Encoder Representations from Transformers
2D continuous softmax66.27Sparse and Continuous Attention Mechanisms
VisualBERT71VisualBERT: A Simple and Performant Baseline for Vision and Language
X2-VLM (large)81.8X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Image features from bottom-up attention (adaptive K, ensemble)70.3Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
MCB [11, 12]62.27Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Up-Down70.34Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Prompt Tuning78.53Prompt Tuning for Generative Multimodal Pretrained Models
MCANed-670.9Deep Modular Co-Attention Networks for Visual Question Answering
BEiT-384.03Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
VLMo81.30VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
VALOR78.62VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
BLOCK67.9BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
mPLUG-Huge83.62mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
DMN68.4Learning to Count Objects in Natural Images for Visual Question Answering
BGN, ensemble75.92Bilinear Graph Networks for Visual Question Answering-
SimVLM80.34SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
VL-BERTLARGE72.2VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Single, w/o VLP74.16In Defense of Grid Features for Visual Question Answering
Single, w/o VLP73.86Deep Multimodal Neural Architecture Search
0 of 38 row(s) selected.