HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Video Retrieval
Video Retrieval On Didemo
Video Retrieval On Didemo
Metrics
text-to-video R@1
text-to-video R@10
text-to-video R@5
Results
Performance results of various models on this benchmark
Columns
Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
InternVideo2-6B
74.2
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)
72.3
94.2
91.2
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VAST
72.0
91.4
89.0
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
70.5
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)
70.4
93.5
90.1
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
GRAM
67.3
90.1
-
Gramian Multimodal Representation Learning and Alignment
VALOR
61.5
90.4
85.3
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VindLU
61.2
91.0
85.8
VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)
61.2
91.5
87.2
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
InternVideo
57.9
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
RTQ
57.6
89.9
84.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
VLAB
56.8
88.7
81.6
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MuLTI
56.5
87.0
80.2
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
HiTeA
56.5
89.7
81.7
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
mPLUG-2
56.4
85.2
79.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CLIP-ViP
55.3
89.3
82
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN
54.6
85.1
78.4
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Singularity
53.9
86.9
79.4
Revealing Single Frame Bias for Video-and-Language Learning
HunYuan_tvr (huge)
52.7
85.2
77.8
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
DMAE (ViT-B/32)
52.7
86.6
79.3
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
0 of 40 row(s) selected.
Previous
Next
Video Retrieval On Didemo | SOTA | HyperAI