HyperAI초신경

Video Retrieval On Youcook2

평가 지표

text-to-video Median Rank

text-to-video R@1

text-to-video R@10

평가 결과

이 벤치마크에서 각 모델의 성능 결과

				Paper Title
VAST	-	50.4	80.8	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VideoCLIP	-	32.2	75.0	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
UniVL + MELTR	3	33.7	74.8	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
MDMMT-2	3.0	32.0	74.8	MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
TACo	4	29.6	72.7	TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
OmniVec	-	-	70.8	OmniVec: Learning robust representations with cross modal sharing
UniVL	4	28.9	70.0	UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
VLM	4	27.05	69.38	VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
OmniVec (pretrained)	-	-	64.2	OmniVec: Learning robust representations with cross modal sharing
VideoCLIP (zero-shot)	-	22.7	63.1	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
VideoCoCa (zero-shot)	-	21.7	55.2	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
COOT	9	16.7	52.3	COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Text-Video Embedding	24	8.2	35.3	HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
RoME	53	6.3	25.2	RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
HGLMM FV CCA	75	4.6	21.6	Associating Neural Word Embeddings With Deep Image Representations Using Fisher Vectors
Satar et al.	77	5.3	20.8	Semantic Role Aware Correlation Transformer for Text to Video Retrieval

0 of 16 row(s) selected.

Video Retrieval On Youcook2 | SOTA | HyperAI초신경