Video Retrieval On Activitynet

평가 지표

text-to-video Median Rank

text-to-video R@1

text-to-video R@5

text-to-video R@50

평가 결과

이 벤치마크에서 각 모델의 성능 결과

					Paper Title
InternVideo2-6B	-	74.1	-	-	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST	-	70.5	90.9	-	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VALOR	-	70.1	90.8	-	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
GRAM	-	69.9	-	-	Gramian Multimodal Representation Learning and Alignment
COSA	-	67.3	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)	-	66.8	89.1	-	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)	-	66.7	88.6	-	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
InternVideo	-	62.2	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP-ViP	1	61.4	85.7	-	CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
HunYuan_tvr	1	57.3	84.8	-	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
VindLU	-	55.0	81.4	-	VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)	-	54.8	80.8	-	TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
RTQ	-	53.5	81.4	-	RTQ: Rethinking Video-language Understanding Based on Image-text Model
DMAE (ViT-B/32)	1.0	53.4	80.7	-	Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
CAMoE	1	51.0	77.7	-	Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
EMCL-Net++	-	50.6	78.7	98.1	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
HiTeA	-	49.7	77.1	-	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
DiffusionRet+QB-Norm	2.0	48.1	-	-	DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Singularity	-	47.1	75.5	-	Revealing Single Frame Bias for Video-and-Language Learning
X-CLIP	-	46.2	75.5	-	X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

0 of 31 row(s) selected.

Command Palette

Video Retrieval On Activitynet

평가 지표

평가 결과