Zero Shot Video Retrieval On Msr Vtt

평가 지표

text-to-video Median Rank

text-to-video R@1

text-to-video R@10

text-to-video R@5

video-to-text Median Rank

video-to-text R@1

video-to-text R@10

video-to-text R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

									Paper Title
InternVideo2-6B	-	55.9	85.1	78.3	-	53.7	84.1	77.5	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
GRAM	-	54.8	83.9	-	-	52.9	82.9	-	Gramian Multimodal Representation Learning and Alignment
InternVideo2-1B	-	51.9	82.5	75.3	-	50.9	81.8	73.4	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST, HowToCaption-finetuned	1	50	81.4	73.2	-	-	-	-	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
VAST	-	49.3	73.9	68.3	-	-	-	-	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
mPLUG-2	-	47.1	79.0	69.7	-	-	-	-	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
LanguageBind(ViT-H/14)	2	44.8	78.7	70.0	2.	40.9	75.7	66.4	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind(ViT-L/14)	2.0	42.8	76.0	67.5	3.0	38.3	77.8	65.8	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
UMT-L (ViT-L/16)	-	42.6	73.1	64.4	-	38.6	69.6	59.8	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)	-	42.1	72.4	63.9	-	37.7	69.4	59.8	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
BT-Adapter	-	40.9	73.5	64.7	-	-	-	-	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
InternVideo	-	40.7	-	-	-	39.6	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
HowToCaption	3	37.6	73.3	62	-	-	-	-	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Florence	-	37.6	72.6	63.8	-	-	-	-	Florence: A New Foundation Model for Computer Vision
ImageBind	-	36.8	70.0	61.8	-	-	-	-	ImageBind: One Embedding Space To Bind Them All
OmniVL	-	34.6	66.6	58.4	-	-	-	-	OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
HiTeA-17M	-	34.4	69.9	60.0	-	-	-	-	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Singularity-17M	-	34.0	66.7	56.7	-	-	-	-	Revealing Single Frame Bias for Video-and-Language Learning
CLIP4Clip	4	32.0	66.9	57.0	-	-	-	-	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Yatai Ji et. al.	-	30.9	65.0	54.4	-	-	-	-	-

0 of 39 row(s) selected.

Command Palette

Zero Shot Video Retrieval On Msr Vtt

평가 지표

평가 결과