HyperAI
HyperAI超神経
ホーム
プラットフォーム
ドキュメント
ニュース
論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
利用規約
プライバシーポリシー
日本語
HyperAI
HyperAI超神経
Toggle Sidebar
サイトを検索…
⌘
K
Command Palette
Search for a command to run...
プラットフォーム
ホーム
SOTA
ビジュアルクエスチョンアンサリング
Visual Question Answering On Msvd Qa 1
Visual Question Answering On Msvd Qa 1
評価指標
Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
Accuracy
Paper Title
VLAB
0.61
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MA-LMM
0.606
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
MaMMUT (ours)
.602
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VAST
0.60
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
0.60
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR
0.60
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
mPLUG-2
0.581
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VideoCoCa
0.569
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
GIT
0.568
GIT: A Generative Image-to-text Transformer for Vision and Language
FrozenBiLM+
0.558
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
HiTeA
0.556
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
InternVideo
0.555
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
UMT-L (ViT-L/16)
0.552
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)
0.549
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
MuLTI
0.547
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
VIOLETv2
0.547
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
X2-VLM (large)
0.546
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)
0.528
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Clover
0.524
Clover: Towards A Unified Video-Language Alignment and Fusion Model
VIOLET + MELTR
0.517
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 36 row(s) selected.
Previous
Next
Visual Question Answering On Msvd Qa 1 | SOTA | HyperAI超神経