HyperAI超神経

Visual Question Answering On Vqa V2 Test Dev

評価指標

Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
Accuracy
Paper TitleRepository
ONE-PEACE82.6ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Pythia v0.3 + LoRRA69.21Towards VQA Models That Can Read
mPLUG (Huge)82.43mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
X-VLM (base)78.22Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
BEiT-384.19Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Prismer78.43Prismer: A Vision-Language Model with Multi-Task Experts
CFR72.5Coarse-to-Fine Reasoning for Visual Question Answering
MUTAN67.42MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Flamingo 80B56.3Flamingo: a Visual Language Model for Few-Shot Learning
Image features from bottom-up attention (adaptive K, ensemble)69.87Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
MMU81.26Achieving Human Parity on Visual Question Answering-
ALBEF (14M)75.84Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Oscar73.82Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
SimVLM80.03SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BLIP-2 ViT-G OPT 2.7B (zero-shot)52.3BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
VK-OOD77.9Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
ViLT-B/3271.26ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
MCAN+VC71.21Visual Commonsense R-CNN
BLIP-2 ViT-L FlanT5 XL (zero-shot)62.3BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-L OPT 2.7B (zero-shot)49.7BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
0 of 56 row(s) selected.