Visual Question Answering On Vqa V2 Test Dev

평가 지표

Accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

		Paper Title
PaLI	84.3	PaLI: A Jointly-Scaled Multilingual Language-Image Model
BEiT-3	84.19	Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
VLMo	82.78	VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
ONE-PEACE	82.6	ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
mPLUG (Huge)	82.43	mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
CuMo-7B	82.2	CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
X2-VLM (large)	81.9	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
MMU	81.26	Achieving Human Parity on Visual Question Answering
Lyrics	81.2	Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
InternVL-C	81.2	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
X2-VLM (base)	80.4	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
XFM (base)	80.4	Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
VAST	80.23	-
SimVLM	80.03	SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
VALOR	78.46	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Prismer	78.43	Prismer: A Vision-Language Model with Multi-Task Experts
X-VLM (base)	78.22	Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
VK-OOD	77.9	Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
ALBEF (14M)	75.84	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Oscar	73.82	Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

0 of 56 row(s) selected.

Command Palette

Visual Question Answering On Vqa V2 Test Dev

평가 지표

평가 결과