HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
服务条款
隐私政策
中文
HyperAI
HyperAI超神经
Toggle Sidebar
全站搜索…
⌘
K
Command Palette
Search for a command to run...
算力平台
首页
SOTA
视觉问答 (VQA)
Visual Question Answering On Gqa Test Dev
Visual Question Answering On Gqa Test Dev
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Accuracy
Paper Title
CFR
72.1
Coarse-to-Fine Reasoning for Visual Question Answering
PaLI-X-VPD
67.3
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
CuMo-7B
64.9
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Video-LaVIT
64.4
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
NSM
62.95
Learning by Abstraction: The Neural State Machine
Lyrics
62.4
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
LXMERT (Pre-train + scratch)
60.0
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
single-hop + LCGN (ours)
55.8
Language-Conditioned Graph Networks for Relational Reasoning
HYDRA
47.9
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
PNP-VQA
41.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
FewVLM (zero-shot)
29.3
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
0 of 17 row(s) selected.
Previous
Next
Visual Question Answering On Gqa Test Dev | SOTA | HyperAI超神经