HyperAI

Zero Shot Video Retrieval On Msvd

Metrics

text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5

Results

Performance results of various models on this benchmark

Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5
Paper TitleRepository
InternVideo2-1B58.188.483.083.396.994.3InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)50.085.577.675.795.190.0vid-TLDR: Training Free Token merging for Light-weight Video Transformer
CLIP4Clip38.576.866.9---CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
SSML13.6647.7435.7---Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
MILES44.487.076.2---MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
HowToCaption44.582.173.3---HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Y. Ge et. al.43.684.974.9---Bridging Video-text Retrieval with Multiple Choice Questions
LanguageBind(ViT-H/14)53.987.880.472.096.391.4LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
VAST, HowToCaption-finetuned54.887.280.9---HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
InternVideo2-6B59.389.684.483.197.094.2InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
UMT-L (ViT-L/16)49.084.776.974.592.889.7Unmasked Teacher: Towards Training-Efficient Video Foundation Models
LaT36.981.068.634.479.269.0LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval-
LanguageBind(ViT-L/14)54.188.181.169.797.991.8LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
InternVideo43.4--67.6--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
0 of 14 row(s) selected.