HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Zero-Shot Video Retrieval
Zero Shot Video Retrieval On Didemo
Zero Shot Video Retrieval On Didemo
Metrics
text-to-video R@1
text-to-video R@10
text-to-video R@5
Results
Performance results of various models on this benchmark
Columns
Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
InternVideo2-6B
57.9
84.6
80.0
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo2-1B
57.0
85.1
80.0
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST
55.5
79.6
74.3
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
GRAM
54.2
80.7
-
Gramian Multimodal Representation Learning and Alignment
vid-TLDR (UMT-L)
52.0
81.0
74.0
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)
48.6
79.0
72.9
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
mPLUG-2
45.7
79.2
71.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
HiTeA-17M
43.2
79.0
69.3
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
LanguageBind(ViT-H/14)
39.9
74.6
66.1
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind(ViT-L/14)
39.7
73.8
65.5
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Singularity-17M
37.1
69.9
61.7
Revealing Single Frame Bias for Video-and-Language Learning
Singularity-5M
36.9
69.3
61.1
Revealing Single Frame Bias for Video-and-Language Learning
HiTeA-5M
36.1
70.3
60.1
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
BT-Adapter
35.6
72.6
61.9
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
OmniVL
33.3
68.5
58.7
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
InternVideo
31.5
68.2
57.6
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Clover
29.5
66.3
55.2
Clover: Towards A Unified Video-Language Alignment and Fusion Model
MILES
27.2
63.6
50.3
-
Y. Ge et. al.
25.6
61.1
50.6
Bridging Video-text Retrieval with Multiple Choice Questions
ALPRO
23.8
57.9
47.3
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
0 of 26 row(s) selected.
Previous
Next