HyperAI초신경

홈 뉴스 연구 논문 튜토리얼 데이터셋 백과사전 SOTA LLM 모델 GPU 랭킹 컨퍼런스

한국어

HyperAI초신경

Visual Reasoning On Nlvr2 Dev

평가 지표

Accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Accuracy	Paper Title	Repository
XFM (base)	87.6	Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
VLMo	85.64	VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
VisualBERT	66.7	VisualBERT: A Simple and Performant Baseline for Vision and Language
VK-OOD	83.9	Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
CoCa	86.1	CoCa: Contrastive Captioners are Image-Text Foundation Models
X-VLM (base)	84.41	Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
X2-VLM (large)	88.7	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
SOHO	76.37	Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
ALBEF (14M)	83.14	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
VK-OOD	84.6	Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	-
ViLT-B/32	75.7	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
SimVLM	84.53	SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BEiT-3	91.51	Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
X2-VLM (base)	86.2	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
LXMERT (Pre-train + scratch)	74.9	LXMERT: Learning Cross-Modality Encoder Representations from Transformers

0 of 15 row(s) selected.