HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
Visual Question Answering
Visual Question Answering On Ok Vqa
Visual Question Answering On Ok Vqa
評価指標
Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
Accuracy
Paper Title
Repository
PaLM-E-562B
66.1
PaLM-E: An Embodied Multimodal Language Model
PICa
48.0
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM
11.4
Language Models are General-Purpose Interfaces
REVIVE (Ensemble)
58.0
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA
61.2
A Simple Baseline for Knowledge-Based Visual Question Answering
-
Prophet
62.5
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA
35.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)
51.22
Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT
43.1
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen
5.9
Multimodal Few-Shot Learning with Frozen Language Models
-
T5(Tan and Bansal, 2019) + Prefixes
42.03
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD
52.4
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM
16.5
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo
47.01
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)
10.5
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
-
RA-VQA-v2 (BLIP 2)
62.08
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
-
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B
41.2
Flamingo: a Visual Language Model for Few-Shot Learning
0 of 37 row(s) selected.
Previous
Next