HyperAI초신경

Zero Shot Video Retrieval On Msr Vtt

평가 지표

text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text Median Rank
video-to-text R@1
video-to-text R@10
video-to-text R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text Median Rank
video-to-text R@1
video-to-text R@10
video-to-text R@5
Paper TitleRepository
LanguageBind(ViT-L/14)2.042.876.067.53.038.377.865.8LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
UMT-L (ViT-L/16)-42.673.164.4-38.669.659.8Unmasked Teacher: Towards Training-Efficient Video Foundation Models
VideoCLIP-10.430.022.2----VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Norton-10.7-24.1----Multi-granularity Correspondence Learning from Long-term Noisy Videos
mPLUG-2-47.179.069.7----mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
OmniVL-34.666.658.4----OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
MIL-NCE-9.932.424.0----End-to-End Learning of Visual Representations from Uncurated Instructional Videos
HiTeA-5M-29.962.954.2----HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
InternVideo2-1B-51.982.575.3-50.981.873.4InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
ALPRO824.155.444.7----Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Singularity-17M-34.066.756.7----Revealing Single Frame Bias for Video-and-Language Learning
MMT66--14.4----Multi-modal Transformer for Video Retrieval
Singularity-5M-28.459.550.2----Revealing Single Frame Bias for Video-and-Language Learning
HiTeA-17M-34.469.960.0----HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
GRAM-54.883.9--52.982.9-Gramian Multimodal Representation Learning and Alignment
CLIP4Clip432.066.957.0----CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
SSML-8.029.321.3----Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
InternVideo-40.7---39.6--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Clover626.46049.5----Clover: Towards A Unified Video-Language Alignment and Fusion Model
HowToCaption337.673.362----HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
0 of 39 row(s) selected.