Video Retrieval On Lsmdc

المقاييس

text-to-video Mean Rank

text-to-video R@1

text-to-video R@10

text-to-video R@5

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

اسم النموذج	text-to-video Mean Rank	text-to-video R@1	text-to-video R@10	text-to-video R@5	Paper Title	Repository
CAMoE	54.4	25.9	53.7	46.1	Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
HD-VILA	-	17.4	44.1	34.1	Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
CLIP4Clip	58.0	21.6	49.8	41.8	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Collaborative Experts	-	11.2	34.8	26.9	Use What You Have: Video Retrieval Using Representations From Collaborative Experts
MDMMT	58.0	18.8	47.9	38.5	MDMMT: Multidomain Multimodal Transformer for Video Retrieval
EMCL-Net	-	23.9	50.9	42.4	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
VALOR	-	34.2	64.1	56.0	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
X-Pool	53.2	25.2	53.5	43.7	X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	8	-	53.7	-	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
HunYuan_tvr (huge)	3.9	40.4	92.8	80.1	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	-
InternVideo	-	34.0	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
MDMMT-2	48.0	26.9	55.9	46.7	MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization	-
CLIP	-	11.3	29.2	22.7	A Straightforward Framework For Video Retrieval Using CLIP
X-CLIP	-	26.1	-	-	X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
HiTeA	-	28.7	59.0	50.3	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
MoEE	-	10.1	34.6	25.6	Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
MMT-Pretrained	-	13.5	40.1	29.9	Multi-modal Transformer for Video Retrieval
QB-Norm+CLIP4Clip	-	22.4	49.5	40.1	Cross Modal Retrieval with Querybank Normalisation
CenterCLIP (ViT-B/16)	47.3	24.2	55.9	46.2	CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
VIOLETv2	-	24	54.1	43.5	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

0 of 38 row(s) selected.