HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
Visual Question Answering
Visual Question Answering On Msrvtt Qa 1
Visual Question Answering On Msrvtt Qa 1
評価指標
Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
Accuracy
Paper Title
Repository
vid-TLDR (UMT-L)
0.470
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)
0.471
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
CLIPBERT
0.374
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
All-in-one+
0.395
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Flamingo (32-shot)
0.310
Flamingo: a Visual Language Model for Few-Shot Learning
FrozenBiLM+
0.470
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
All-in-one-B
0.443
All in One: Exploring Unified Video-Language Pre-training
HBI
0.462
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Co-Mem
0.32
Motion-Appearance Co-Memory Networks for Video Question Answering
-
VideoCoCa
0.463
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
ST-VQA
0.309
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Clover
0.441
Clover: Towards A Unified Video-Language Alignment and Fusion Model
ALPRO
0.421
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Co-Tokenization
.457
Video Question Answering with Iterative Video-Text Co-Tokenization
-
VLAB
0.496
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
-
HMEMA
0.33
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
X2-VLM (base)
0.45
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
AIO+MDF
0.438
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
JustAsk+
0.418
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
LRCE
0.42
Lightweight Recurrent Cross-modal Encoder for Video Question Answering
0 of 34 row(s) selected.
Previous
Next