HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
Zero Shot Video Retrieval
Zero Shot Video Retrieval On Msvd
Zero Shot Video Retrieval On Msvd
評価指標
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5
Paper Title
Repository
InternVideo2-1B
58.1
88.4
83.0
83.3
96.9
94.3
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)
50.0
85.5
77.6
75.7
95.1
90.0
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
CLIP4Clip
38.5
76.8
66.9
-
-
-
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
SSML
13.66
47.74
35.7
-
-
-
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
MILES
44.4
87.0
76.2
-
-
-
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
HowToCaption
44.5
82.1
73.3
-
-
-
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Y. Ge et. al.
43.6
84.9
74.9
-
-
-
Bridging Video-text Retrieval with Multiple Choice Questions
LanguageBind(ViT-H/14)
53.9
87.8
80.4
72.0
96.3
91.4
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
VAST, HowToCaption-finetuned
54.8
87.2
80.9
-
-
-
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
InternVideo2-6B
59.3
89.6
84.4
83.1
97.0
94.2
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
UMT-L (ViT-L/16)
49.0
84.7
76.9
74.5
92.8
89.7
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
LaT
36.9
81.0
68.6
34.4
79.2
69.0
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
-
LanguageBind(ViT-L/14)
54.1
88.1
81.1
69.7
97.9
91.8
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
InternVideo
43.4
-
-
67.6
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
0 of 14 row(s) selected.
Previous
Next