Video Retrieval On Didemo

평가 지표

text-to-video R@1

text-to-video R@10

text-to-video R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

				Paper Title
InternVideo2-6B	74.2	-	-	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)	72.3	94.2	91.2	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VAST	72.0	91.4	89.0	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA	70.5	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)	70.4	93.5	90.1	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
GRAM	67.3	90.1	-	Gramian Multimodal Representation Learning and Alignment
VALOR	61.5	90.4	85.3	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VindLU	61.2	91.0	85.8	VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)	61.2	91.5	87.2	TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
InternVideo	57.9	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
RTQ	57.6	89.9	84.1	RTQ: Rethinking Video-language Understanding Based on Image-text Model
VLAB	56.8	88.7	81.6	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MuLTI	56.5	87.0	80.2	MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
HiTeA	56.5	89.7	81.7	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
mPLUG-2	56.4	85.2	79.1	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CLIP-ViP	55.3	89.3	82	CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN	54.6	85.1	78.4	Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Singularity	53.9	86.9	79.4	Revealing Single Frame Bias for Video-and-Language Learning
HunYuan_tvr (huge)	52.7	85.2	77.8	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
DMAE (ViT-B/32)	52.7	86.6	79.3	Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

0 of 40 row(s) selected.

Command Palette

Video Retrieval On Didemo

평가 지표

평가 결과