HyperAI
HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
시각적 질문 응답 (VQA)
Visual Question Answering On Vqa V2 Test Std
Visual Question Answering On Vqa V2 Test Std
평가 지표
overall
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
overall
Paper Title
Repository
LXMERT
72.5
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
2D continuous softmax
66.27
Sparse and Continuous Attention Mechanisms
VisualBERT
71
VisualBERT: A Simple and Performant Baseline for Vision and Language
X2-VLM (large)
81.8
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Image features from bottom-up attention (adaptive K, ensemble)
70.3
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
MCB [11, 12]
62.27
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Up-Down
70.34
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Prompt Tuning
78.53
Prompt Tuning for Generative Multimodal Pretrained Models
MCANed-6
70.9
Deep Modular Co-Attention Networks for Visual Question Answering
BEiT-3
84.03
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
VLMo
81.30
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
VALOR
78.62
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
BLOCK
67.9
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
mPLUG-Huge
83.62
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
DMN
68.4
Learning to Count Objects in Natural Images for Visual Question Answering
BGN, ensemble
75.92
Bilinear Graph Networks for Visual Question Answering
-
SimVLM
80.34
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
VL-BERTLARGE
72.2
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Single, w/o VLP
74.16
In Defense of Grid Features for Visual Question Answering
Single, w/o VLP
73.86
Deep Multimodal Neural Architecture Search
0 of 38 row(s) selected.
Previous
Next
Visual Question Answering On Vqa V2 Test Std | SOTA | HyperAI초신경