HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
시스템 설정
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
로그인
로그인
홈
SOTA
Visual Question Answering
Visual Question Answering On Vqa V2 Test Dev
Visual Question Answering On Vqa V2 Test Dev
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
Repository
ONE-PEACE
82.6
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Pythia v0.3 + LoRRA
69.21
Towards VQA Models That Can Read
mPLUG (Huge)
82.43
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
X-VLM (base)
78.22
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
BEiT-3
84.19
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Prismer
78.43
Prismer: A Vision-Language Model with Multi-Task Experts
CFR
72.5
Coarse-to-Fine Reasoning for Visual Question Answering
MUTAN
67.42
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Flamingo 80B
56.3
Flamingo: a Visual Language Model for Few-Shot Learning
Image features from bottom-up attention (adaptive K, ensemble)
69.87
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
MMU
81.26
Achieving Human Parity on Visual Question Answering
-
ALBEF (14M)
75.84
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Oscar
73.82
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
SimVLM
80.03
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BLIP-2 ViT-G OPT 2.7B (zero-shot)
52.3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
VK-OOD
77.9
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
ViLT-B/32
71.26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
MCAN+VC
71.21
Visual Commonsense R-CNN
BLIP-2 ViT-L FlanT5 XL (zero-shot)
62.3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-L OPT 2.7B (zero-shot)
49.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
0 of 56 row(s) selected.
Previous
Next