HyperAI
HyperAI
Accueil
Actualités
Articles de recherche récents
Tutoriels
Ensembles de données
Wiki
SOTA
Modèles LLM
Classement GPU
Événements
Recherche
À propos
Français
HyperAI
HyperAI
Toggle sidebar
Rechercher sur le site...
⌘
K
Accueil
SOTA
Visual Question Answering (VQA) - Questionnement Visuel Automatisé
Visual Question Answering On Ok Vqa
Visual Question Answering On Ok Vqa
Métriques
Accuracy
Résultats
Résultats de performance de divers modèles sur ce benchmark
Columns
Nom du modèle
Accuracy
Paper Title
Repository
PaLM-E-562B
66.1
PaLM-E: An Embodied Multimodal Language Model
PICa
48.0
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM
11.4
Language Models are General-Purpose Interfaces
REVIVE (Ensemble)
58.0
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA
61.2
A Simple Baseline for Knowledge-Based Visual Question Answering
-
Prophet
62.5
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA
35.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)
51.22
Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT
43.1
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen
5.9
Multimodal Few-Shot Learning with Frozen Language Models
-
T5(Tan and Bansal, 2019) + Prefixes
42.03
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD
52.4
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM
16.5
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo
47.01
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)
10.5
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
-
RA-VQA-v2 (BLIP 2)
62.08
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
-
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B
41.2
Flamingo: a Visual Language Model for Few-Shot Learning
0 of 37 row(s) selected.
Previous
Next
Visual Question Answering On Ok Vqa | SOTA | HyperAI