HyperAI超神经

Video Retrieval On Lsmdc

评估指标

text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5

评测结果

各个模型在此基准测试上的表现结果

模型名称
text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper TitleRepository
CAMoE54.425.953.746.1Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
HD-VILA-17.444.134.1Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
CLIP4Clip58.021.649.841.8CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Collaborative Experts-11.234.826.9Use What You Have: Video Retrieval Using Representations From Collaborative Experts
MDMMT58.018.847.938.5MDMMT: Multidomain Multimodal Transformer for Video Retrieval
EMCL-Net-23.950.942.4Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
VALOR-34.264.156.0VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
X-Pool53.225.253.543.7X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)8-53.7-Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
HunYuan_tvr (huge)3.940.492.880.1Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
InternVideo-34.0--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
MDMMT-248.026.955.946.7MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization-
CLIP-11.329.222.7A Straightforward Framework For Video Retrieval Using CLIP
X-CLIP-26.1--X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
HiTeA-28.759.050.3HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
MoEE-10.134.625.6Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
MMT-Pretrained-13.540.129.9Multi-modal Transformer for Video Retrieval
QB-Norm+CLIP4Clip-22.449.540.1Cross Modal Retrieval with Querybank Normalisation
CenterCLIP (ViT-B/16)47.324.255.946.2CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
VIOLETv2-2454.143.5An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
0 of 38 row(s) selected.