HyperAI

Zero Shot Video Retrieval On Didemo

Metrics

text-to-video R@1
text-to-video R@10
text-to-video R@5

Results

Performance results of various models on this benchmark

Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper TitleRepository
Singularity-5M36.969.361.1Revealing Single Frame Bias for Video-and-Language Learning
InternVideo2-6B57.984.680.0InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
BT-Adapter35.672.661.9BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
LanguageBind(ViT-H/14)39.974.666.1LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
HiTeA-17M43.279.069.3HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
Clover29.566.355.2Clover: Towards A Unified Video-Language Alignment and Fusion Model
LanguageBind(ViT-L/14)39.773.865.5LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
mPLUG-245.779.271.1mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VAST55.579.674.3VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Singularity-17M37.169.961.7Revealing Single Frame Bias for Video-and-Language Learning
VIOLET23.559.849.8VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
MILES27.263.650.3MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
GRAM54.280.7-Gramian Multimodal Representation Learning and Alignment
ALPRO23.857.947.3Align and Prompt: Video-and-Language Pre-training with Entity Prompts
InternVideo31.568.257.6InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoCLIP16.6-46.9VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
FROZEN21.156.246.0Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Y. Ge et. al.25.661.150.6Bridging Video-text Retrieval with Multiple Choice Questions
HiTeA-5M36.170.360.1HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
OA-Trans23.559.850.4Object-aware Video-language Pre-training for Retrieval
0 of 26 row(s) selected.