HyperAI超神经

Video Retrieval On Youcook2

评估指标

text-to-video Median Rank
text-to-video R@1
text-to-video R@10

评测结果

各个模型在此基准测试上的表现结果

模型名称
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
Paper TitleRepository
COOT916.752.3COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Text-Video Embedding248.235.3HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
HGLMM FV CCA754.621.6Associating Neural Word Embeddings With Deep Image Representations Using Fisher Vectors-
Satar et al.775.320.8Semantic Role Aware Correlation Transformer for Text to Video Retrieval
RoME536.325.2RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
VideoCLIP-32.275.0VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
VLM427.0569.38VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
TACo429.672.7TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment-
VAST-50.480.8VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
OmniVec (pretrained)--64.2OmniVec: Learning robust representations with cross modal sharing-
UniVL + MELTR333.774.8MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
OmniVec--70.8OmniVec: Learning robust representations with cross modal sharing-
UniVL428.970.0UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
VideoCLIP (zero-shot)-22.763.1VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
VideoCoCa (zero-shot)-21.755.2VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners-
MDMMT-23.032.074.8MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization-
0 of 16 row(s) selected.