HyperAI
HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
시각적 질문 응답 (VQA)
Visual Question Answering On Gqa Test Dev
Visual Question Answering On Gqa Test Dev
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
Repository
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
PNP-VQA
41.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
PaLI-X-VPD
67.3
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
-
LXMERT (Pre-train + scratch)
60.0
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
FewVLM (zero-shot)
29.3
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
HYDRA
47.9
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
NSM
62.95
Learning by Abstraction: The Neural State Machine
Lyrics
62.4
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
-
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
single-hop + LCGN (ours)
55.8
Language-Conditioned Graph Networks for Relational Reasoning
CFR
72.1
Coarse-to-Fine Reasoning for Visual Question Answering
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Video-LaVIT
64.4
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
CuMo-7B
64.9
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
0 of 17 row(s) selected.
Previous
Next
Visual Question Answering On Gqa Test Dev | SOTA | HyperAI초신경