HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Zero Shot Video Retrieval
Zero Shot Video Retrieval On Didemo
Zero Shot Video Retrieval On Didemo
평가 지표
text-to-video R@1
text-to-video R@10
text-to-video R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
Singularity-5M
36.9
69.3
61.1
Revealing Single Frame Bias for Video-and-Language Learning
InternVideo2-6B
57.9
84.6
80.0
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
BT-Adapter
35.6
72.6
61.9
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
LanguageBind(ViT-H/14)
39.9
74.6
66.1
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
HiTeA-17M
43.2
79.0
69.3
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
Clover
29.5
66.3
55.2
Clover: Towards A Unified Video-Language Alignment and Fusion Model
LanguageBind(ViT-L/14)
39.7
73.8
65.5
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
mPLUG-2
45.7
79.2
71.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VAST
55.5
79.6
74.3
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Singularity-17M
37.1
69.9
61.7
Revealing Single Frame Bias for Video-and-Language Learning
VIOLET
23.5
59.8
49.8
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
MILES
27.2
63.6
50.3
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
GRAM
54.2
80.7
-
Gramian Multimodal Representation Learning and Alignment
ALPRO
23.8
57.9
47.3
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
InternVideo
31.5
68.2
57.6
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoCLIP
16.6
-
46.9
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
FROZEN
21.1
56.2
46.0
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Y. Ge et. al.
25.6
61.1
50.6
Bridging Video-text Retrieval with Multiple Choice Questions
HiTeA-5M
36.1
70.3
60.1
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
OA-Trans
23.5
59.8
50.4
Object-aware Video-language Pre-training for Retrieval
0 of 26 row(s) selected.
Previous
Next