HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Video Retrieval
Video Retrieval On Msr Vtt
Video Retrieval On Msr Vtt
Metrics
text-to-video R@1
text-to-video R@10
text-to-video R@5
Results
Performance results of various models on this benchmark
Columns
Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
GRAM
64
89.3
-
Gramian Multimodal Representation Learning and Alignment
VAST
63.9
89.6
84.3
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
InternVideo2-6B
62.8
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VALOR
59.9
89.6
83.5
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
UMT-L (ViT-L/16)
58.8
87.1
81.0
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)
58.1
81.6
81.0
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
COSA
57.9
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
InternVideo
55.2
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VLAB
55.1
87.6
78.8
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Aurora (ours, r=64)
52.4
82
73.9
-
TEFAL
52
86.1
76.6
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
UCoFiA
49.4
83.5
72.1
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
OmniVL
47.8
83.8
74.2
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
CLIP4Clip-seqTransf
44.5
81.6
71.4
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
All-in-one + MELTR
38.6
84.7
74.4
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VIOLETv2
37.2
75.8
64.8
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
HD-VILA
35.6
78
65.3
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
VideoCoCa (zero-shot)
34.3
67.0
57.8
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
MDMMT-2
33.7
70.8
60.5
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
VIOLET + MELTR
33.6
77.8
63.7
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 40 row(s) selected.
Previous
Next