HyperAI超神経

Video Retrieval On Didemo

評価指標

text-to-video R@1
text-to-video R@10
text-to-video R@5

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper TitleRepository
RTQ57.689.984.1RTQ: Rethinking Video-language Understanding Based on Image-text Model
FROZEN31.072.459.8Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
DiffusionRet+QB-Norm48.983.375.5DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
HunYuan_tvr (huge)52.785.277.8Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
HD-VILA28.869.157.4Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
QB-Norm+CLIP4Clip43.580.971.4Cross Modal Retrieval with Querybank Normalisation
Cap4Video52.087.579.4Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
STAN54.685.178.4Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
CLIP-ViP55.389.382CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
ALPRO35.978.867.5Align and Prompt: Video-and-Language Pre-training with Entity Prompts
DRL49.084.576.5Disentangled Representation Learning for Text-Video Retrieval
VAST72.091.489.0VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
DMAE (ViT-B/32)52.786.679.3Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning-
mPLUG-256.485.279.1mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Clover50.185.676.7Clover: Towards A Unified Video-Language Alignment and Fusion Model
VindLU61.291.085.8VindLU: A Recipe for Effective Video-and-Language Pretraining
Collaborative Experts16.154.441.1Use What You Have: Video Retrieval Using Representations From Collaborative Experts
MuLTI56.587.080.2MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling-
UMT-L (ViT-L/16)70.493.590.1Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HiTeA56.589.781.7HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
0 of 40 row(s) selected.