Video Retrieval On Msvd

평가 지표

text-to-video R@1

video-to-text R@1

평가 결과

이 벤치마크에서 각 모델의 성능 결과

			Paper Title
InternVideo2-6B	61.4	85.2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
HunYuan_tvr (huge)	59.0	73.0	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
InternVideo	58.4	76.3	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
HunYuan_tvr	58.2	69.1	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
vid-TLDR (UMT-L)	57.9	82.7	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VLAB	57.5	-	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MDMMT-2	56.8	-	MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Side4Video	56.1	-	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
CAMoE	51.8	69.3	Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Cap4Video	51.8	70.0	Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
CenterCLIP (ViT-B/16)	50.6	68.4	CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
X-CLIP	50.4	66.8	X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
DMAE (ViT-B/32)	48.7	-	Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
QB-Norm+CLIP2Video	48.0	-	Cross Modal Retrieval with Querybank Normalisation
DiffusionRet+QB-Norm	47.9	60.3	DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
PAU	47.3	68.9	Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
X-Pool	47.2	66.4	X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
DiffusionRet	46.6	61.9	DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
CLIP4Clip	46.2	62.0	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
LAFF	45.4	-	Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

0 of 24 row(s) selected.

Command Palette

Video Retrieval On Msvd

평가 지표

평가 결과