Video Retrieval On Msr Vtt 1Ka

평가 지표

text-to-video Median Rank

text-to-video R@1

text-to-video R@10

text-to-video R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

					Paper Title
HunYuan_tvr (huge)	1.0	62.9	90.8	84.5	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
OmniVec	-	-	89.4	-	OmniVec: Learning robust representations with cross modal sharing
CLIP-ViP	1.0	57.7	88.2	80.5	CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN	1	54.1	87.8	79.5	Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
PIDRo	1.0	55.9	87.6	79.8	PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
DRL	1	53.3	87.6	80.3	Disentangled Representation Learning for Text-Video Retrieval
TS2-Net	-	54.0	87.4	79.3	TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
DMAE (ViT-B/16)	1.0	55.5	87.1	79.4	Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
EERCF	-	54.1	86.9	78.8	Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
CLIP2TV	1	52.9	86.5	78.5	CLIP2TV: Align, Match and Distill for Video-Text Retrieval
MuLTI	-	54.7	86.0	77.7	MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
EMCL-Net++	-	51.6	85.3	78.1	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
CAMoE	2	48.8	85.3	75.6	Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
X-CLIP	2.0	49.3	84.8	75.8	X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
mPLUG-2	-	53.1	84.7	77.6	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
RTQ	-	53.4	84.4	76.1	RTQ: Rethinking Video-language Understanding Based on Image-text Model
Side4Video	1.0	52.3	84.2	75.5	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
X2-VLM (large)	-	49.6	84.2	76.7	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)	-	47.6	84.2	74.1	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Cap4Video	1	51.4	83.9	75.7	Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

0 of 63 row(s) selected.

Command Palette

Video Retrieval On Msr Vtt 1Ka

평가 지표

평가 결과