HyperAI초신경

Zero Shot Video Retrieval On Lsmdc

평가 지표

text-to-video R@1
text-to-video R@10
text-to-video R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper TitleRepository
BT-Adapter19.545.035.9BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
HowToCaption17.338.631.7HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HiTeA-17M18.344.236.7HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
Y. Ge et. al.12.232.225.9Bridging Video-text Retrieval with Multiple Choice Questions
SSML4.217.111.6Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
InternVideo2-6B33.862.255.9InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
CLIP4Clip15.136.428.5CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
UMT-L (ViT-L/16)25.250.543.0Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HiTeA-5M15.539.831.1HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
mPLUG-224.152.043.8mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
MILES11.130.624.7MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
InternVideo2-1B32.059.452.4InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo17.640.232.4InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Clover14.738.229.2Clover: Towards A Unified Video-Language Alignment and Fusion Model
Yatai Ji et. al.17.239.132.4Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
VAST, HowToCaption-finetuned27.754.646.5HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
0 of 16 row(s) selected.