HyperAI초신경

Video Retrieval On Vatex

평가 지표

text-to-video R@1

text-to-video R@10

text-to-video R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

				Paper Title
GRAM	87.7	100	-	Gramian Multimodal Representation Learning and Alignment
VAST	83.0	99.2	98.2	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VALOR	78.5	98.7	97.1	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
InternVideo2-6B	75.5	-	-	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Unmasked Teacher	72	97.8	95.1	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
InternVideo	71.1	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Side4Video	68.8	97.0	93.5	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Cap4Video	66.6	97.0	93.1	Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
TeachCLIP	63.6	96.1	91.9	Holistic Features are almost Sufficient for Text-to-Video Retrieval
TS2-Net	59.1	95.2	-	TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
LAFF	59.1	91.7	-	Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
QB-Norm+CLIP2Video	58.8	93.8	-	Cross Modal Retrieval with Querybank Normalisation
CLIP2Video	57.3	90	-	CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

0 of 13 row(s) selected.

Video Retrieval On Vatex | SOTA | HyperAI초신경