Zero Shot Video Retrieval On Msvd

Metrics

text-to-video R@1

text-to-video R@10

text-to-video R@5

video-to-text R@1

video-to-text R@10

video-to-text R@5

Results

Performance results of various models on this benchmark

							Paper Title
InternVideo2-6B	59.3	89.6	84.4	83.1	97.0	94.2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo2-1B	58.1	88.4	83.0	83.3	96.9	94.3	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST, HowToCaption-finetuned	54.8	87.2	80.9	-	-	-	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
LanguageBind(ViT-L/14)	54.1	88.1	81.1	69.7	97.9	91.8	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind(ViT-H/14)	53.9	87.8	80.4	72.0	96.3	91.4	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
vid-TLDR (UMT-L)	50.0	85.5	77.6	75.7	95.1	90.0	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)	49.0	84.7	76.9	74.5	92.8	89.7	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HowToCaption	44.5	82.1	73.3	-	-	-	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
MILES	44.4	87.0	76.2	-	-	-	-
Y. Ge et. al.	43.6	84.9	74.9	-	-	-	Bridging Video-text Retrieval with Multiple Choice Questions
InternVideo	43.4	-	-	67.6	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP4Clip	38.5	76.8	66.9	-	-	-	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
LaT	36.9	81.0	68.6	34.4	79.2	69.0	-
SSML	13.66	47.74	35.7	-	-	-	Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

0 of 14 row(s) selected.

Zero Shot Video Retrieval On Msvd | SOTA | HyperAI