HyperAI
HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
ビジュアルクエスチョンアンサリング
Visual Question Answering On Gqa Test Dev
Visual Question Answering On Gqa Test Dev
評価指標
Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
Accuracy
Paper Title
Repository
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
PNP-VQA
41.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
PaLI-X-VPD
67.3
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
-
LXMERT (Pre-train + scratch)
60.0
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
FewVLM (zero-shot)
29.3
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
HYDRA
47.9
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
NSM
62.95
Learning by Abstraction: The Neural State Machine
Lyrics
62.4
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
-
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
single-hop + LCGN (ours)
55.8
Language-Conditioned Graph Networks for Relational Reasoning
CFR
72.1
Coarse-to-Fine Reasoning for Visual Question Answering
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Video-LaVIT
64.4
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
CuMo-7B
64.9
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
0 of 17 row(s) selected.
Previous
Next
Visual Question Answering On Gqa Test Dev | SOTA | HyperAI超神経