Video Retrieval On Ssv2 Label Retrieval
Métriques
text-to-video R@1
text-to-video R@10
text-to-video R@5
Résultats
Résultats de performance de divers modèles sur ce benchmark
Nom du modèle | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 | Paper Title | Repository |
---|---|---|---|---|---|
Singularity-temporal | 47.4 | 84 | 75.9 | Revealing Single Frame Bias for Video-and-Language Learning | |
vid-TLDR (UMT-L) | 73.1 | 96.6 | 93.3 | vid-TLDR: Training Free Token merging for Light-weight Video Transformer | |
UMT-L (ViT-L/16) | 73.3 | 96.6 | 92.7 | Unmasked Teacher: Towards Training-Efficient Video Foundation Models | |
HiTeA | 55.2 | 81.4 | 89.1 | HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | - |
VindLU | 53.1 | - | 81.8 | VindLU: A Recipe for Effective Video-and-Language Pretraining |
0 of 5 row(s) selected.