HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Visual Question Answering (VQA)
Visual Question Answering On Gqa Test Dev
Visual Question Answering On Gqa Test Dev
Metrics
Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Accuracy
Paper Title
CFR
72.1
Coarse-to-Fine Reasoning for Visual Question Answering
PaLI-X-VPD
67.3
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
CuMo-7B
64.9
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Video-LaVIT
64.4
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
NSM
62.95
Learning by Abstraction: The Neural State Machine
Lyrics
62.4
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
LXMERT (Pre-train + scratch)
60.0
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
single-hop + LCGN (ours)
55.8
Language-Conditioned Graph Networks for Relational Reasoning
HYDRA
47.9
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
PNP-VQA
41.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
FewVLM (zero-shot)
29.3
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
0 of 17 row(s) selected.
Previous
Next
Visual Question Answering On Gqa Test Dev | SOTA | HyperAI