HyperAI超神経

Visual Question Answering On Ok Vqa

評価指標

Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
Accuracy
Paper TitleRepository
PaLM-E-562B66.1PaLM-E: An Embodied Multimodal Language Model
PICa48.0An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM11.4Language Models are General-Purpose Interfaces
REVIVE (Ensemble)58.0REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA61.2A Simple Baseline for Knowledge-Based Visual Question Answering-
Prophet62.5Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA35.9Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)51.22Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT43.1VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)39.4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen 5.9Multimodal Few-Shot Learning with Frozen Language Models-
T5(Tan and Bansal, 2019) + Prefixes42.03LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD52.4Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM16.5A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)45.9BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo47.01LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)10.5Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation-
RA-VQA-v2 (BLIP 2)62.08Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering-
BLIP-2 ViT-G OPT 2.7B (zero-shot)31.7BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B41.2Flamingo: a Visual Language Model for Few-Shot Learning
0 of 37 row(s) selected.