HyperAI
HyperAI초신경
홈
플랫폼
문서
뉴스
연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
서비스 약관
개인정보 처리방침
한국어
HyperAI
HyperAI초신경
Toggle Sidebar
전체 사이트 검색...
⌘
K
Command Palette
Search for a command to run...
플랫폼
홈
SOTA
시각적 질문 응답 (VQA)
Visual Question Answering On Msrvtt Qa 1
Visual Question Answering On Msrvtt Qa 1
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
VLAB
0.496
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MaMMUT
0.495
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
mPLUG-2
0.480
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
MuLTI
0.478
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Flamingo
0.474
Flamingo: a Visual Language Model for Few-Shot Learning
UMT-L (ViT-L/16)
0.471
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
InternVideo
0.471
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
vid-TLDR (UMT-L)
0.470
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
FrozenBiLM+
0.470
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
VideoCoCa
0.463
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
HBI
0.462
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
HiTeA
0.459
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
EMCL-Net
0.458
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Co-Tokenization
.457
Video Question Answering with Iterative Video-Text Co-Tokenization
X2-VLM (large)
0.455
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)
0.45
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
All-in-one-B
0.443
All in One: Exploring Unified Video-Language Pre-training
Clover
0.441
Clover: Towards A Unified Video-Language Alignment and Fusion Model
OmniVL
0.441
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
AIO+MIF
0.440
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
0 of 34 row(s) selected.
Previous
Next
Visual Question Answering On Msrvtt Qa 1 | SOTA | HyperAI초신경