HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Video Retrieval
Video Retrieval On Youcook2
Video Retrieval On Youcook2
Metrics
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
Results
Performance results of various models on this benchmark
Columns
Model Name
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
Paper Title
VAST
-
50.4
80.8
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VideoCLIP
-
32.2
75.0
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
UniVL + MELTR
3
33.7
74.8
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
MDMMT-2
3.0
32.0
74.8
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
TACo
4
29.6
72.7
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
OmniVec
-
-
70.8
OmniVec: Learning robust representations with cross modal sharing
UniVL
4
28.9
70.0
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
VLM
4
27.05
69.38
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
OmniVec (pretrained)
-
-
64.2
OmniVec: Learning robust representations with cross modal sharing
VideoCLIP (zero-shot)
-
22.7
63.1
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
VideoCoCa (zero-shot)
-
21.7
55.2
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
COOT
9
16.7
52.3
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Text-Video Embedding
24
8.2
35.3
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
RoME
53
6.3
25.2
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
HGLMM FV CCA
75
4.6
21.6
Associating Neural Word Embeddings With Deep Image Representations Using Fisher Vectors
Satar et al.
77
5.3
20.8
Semantic Role Aware Correlation Transformer for Text to Video Retrieval
0 of 16 row(s) selected.
Previous
Next