HyperAI超神经

Video Retrieval On Msr Vtt 1Ka

评估指标

text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5

评测结果

各个模型在此基准测试上的表现结果

模型名称
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper TitleRepository
UniVL + MELTR431.168.355.7MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Side4Video1.052.384.275.5Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
CLIP2TV152.986.578.5CLIP2TV: Align, Match and Distill for Video-Text Retrieval-
OmniVec--89.4-OmniVec: Learning robust representations with cross modal sharing-
HiTeA-46.881.971.2HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
LAFF-45.88271.5Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
TeachCLIP (ViT-B/16)-48.083.575.9Holistic Features are almost Sufficient for Text-to-Video Retrieval
COTS236.873.263.8COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval-
Florence-37.672.663.8Florence: A New Foundation Model for Computer Vision
Singularity-41.57768.7Revealing Single Frame Bias for Video-and-Language Learning
X-CLIP2.049.384.875.8X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
CLIP4Clip2-81.6-CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
TACo428.471.257.8TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment-
MDMMT238.979.769.0MDMMT: Multidomain Multimodal Transformer for Video Retrieval
RTQ-53.484.476.1RTQ: Rethinking Video-language Understanding Based on Image-text Model
UCoFiA-49.483.572.1Unified Coarse-to-Fine Alignment for Video-Text Retrieval-
All-in-one + MELTR-41.382.573.5MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
PIDRo1.055.987.679.8PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval-
STAN154.187.879.5Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
FROZEN331.070.559.5Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
0 of 63 row(s) selected.