HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Video Retrieval
Video Retrieval On Msvd
Video Retrieval On Msvd
Metrics
text-to-video R@1
video-to-text R@1
Results
Performance results of various models on this benchmark
Columns
Model Name
text-to-video R@1
video-to-text R@1
Paper Title
InternVideo2-6B
61.4
85.2
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
HunYuan_tvr (huge)
59.0
73.0
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
InternVideo
58.4
76.3
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
HunYuan_tvr
58.2
69.1
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
vid-TLDR (UMT-L)
57.9
82.7
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VLAB
57.5
-
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MDMMT-2
56.8
-
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Side4Video
56.1
-
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
CAMoE
51.8
69.3
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Cap4Video
51.8
70.0
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
CenterCLIP (ViT-B/16)
50.6
68.4
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
X-CLIP
50.4
66.8
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
DMAE (ViT-B/32)
48.7
-
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
QB-Norm+CLIP2Video
48.0
-
Cross Modal Retrieval with Querybank Normalisation
DiffusionRet+QB-Norm
47.9
60.3
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
PAU
47.3
68.9
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
X-Pool
47.2
66.4
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
DiffusionRet
46.6
61.9
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
CLIP4Clip
46.2
62.0
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
LAFF
45.4
-
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
0 of 24 row(s) selected.
Previous
Next