HyperAI
HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
시각적 추론
Visual Reasoning On Nlvr2 Dev
Visual Reasoning On Nlvr2 Dev
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
Repository
XFM (base)
87.6
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
-
VLMo
85.64
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
-
VisualBERT
66.7
VisualBERT: A Simple and Performant Baseline for Vision and Language
-
VK-OOD
83.9
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
-
CoCa
86.1
CoCa: Contrastive Captioners are Image-Text Foundation Models
-
X-VLM (base)
84.41
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
-
X2-VLM (large)
88.7
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
-
SOHO
76.37
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
-
ALBEF (14M)
83.14
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
-
VK-OOD
84.6
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
ViLT-B/32
75.7
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
-
SimVLM
84.53
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
-
BEiT-3
91.51
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
-
X2-VLM (base)
86.2
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
-
LXMERT (Pre-train + scratch)
74.9
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
-
0 of 15 row(s) selected.
Previous
Next