Zero Shot Video Retrieval On Didemo

Metrics

text-to-video R@1

text-to-video R@10

text-to-video R@5

Results

Performance results of various models on this benchmark

				Paper Title
InternVideo2-6B	57.9	84.6	80.0	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo2-1B	57.0	85.1	80.0	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST	55.5	79.6	74.3	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
GRAM	54.2	80.7	-	Gramian Multimodal Representation Learning and Alignment
vid-TLDR (UMT-L)	52.0	81.0	74.0	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)	48.6	79.0	72.9	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
mPLUG-2	45.7	79.2	71.1	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
HiTeA-17M	43.2	79.0	69.3	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
LanguageBind(ViT-H/14)	39.9	74.6	66.1	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind(ViT-L/14)	39.7	73.8	65.5	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Singularity-17M	37.1	69.9	61.7	Revealing Single Frame Bias for Video-and-Language Learning
Singularity-5M	36.9	69.3	61.1	Revealing Single Frame Bias for Video-and-Language Learning
HiTeA-5M	36.1	70.3	60.1	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
BT-Adapter	35.6	72.6	61.9	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
OmniVL	33.3	68.5	58.7	OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
InternVideo	31.5	68.2	57.6	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Clover	29.5	66.3	55.2	Clover: Towards A Unified Video-Language Alignment and Fusion Model
MILES	27.2	63.6	50.3	-
Y. Ge et. al.	25.6	61.1	50.6	Bridging Video-text Retrieval with Multiple Choice Questions
ALPRO	23.8	57.9	47.3	Align and Prompt: Video-and-Language Pre-training with Entity Prompts

0 of 26 row(s) selected.

Command Palette

Zero Shot Video Retrieval On Didemo

Metrics

Results