HyperAI
HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Visuelles Fragebeantworten (VQA)
Visual Question Answering On Ok Vqa
Visual Question Answering On Ok Vqa
Metriken
Accuracy
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
Accuracy
Paper Title
Repository
PaLM-E-562B
66.1
PaLM-E: An Embodied Multimodal Language Model
PICa
48.0
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM
11.4
Language Models are General-Purpose Interfaces
REVIVE (Ensemble)
58.0
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA
61.2
A Simple Baseline for Knowledge-Based Visual Question Answering
-
Prophet
62.5
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA
35.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)
51.22
Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT
43.1
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen
5.9
Multimodal Few-Shot Learning with Frozen Language Models
-
T5(Tan and Bansal, 2019) + Prefixes
42.03
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD
52.4
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM
16.5
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo
47.01
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)
10.5
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
-
RA-VQA-v2 (BLIP 2)
62.08
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
-
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B
41.2
Flamingo: a Visual Language Model for Few-Shot Learning
0 of 37 row(s) selected.
Previous
Next
Visual Question Answering On Ok Vqa | SOTA | HyperAI