Video Retrieval On Msr Vtt

Metrics

text-to-video R@1

text-to-video R@10

text-to-video R@5

Results

Performance results of various models on this benchmark

				Paper Title
GRAM	64	89.3	-	Gramian Multimodal Representation Learning and Alignment
VAST	63.9	89.6	84.3	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
InternVideo2-6B	62.8	-	-	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VALOR	59.9	89.6	83.5	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
UMT-L (ViT-L/16)	58.8	87.1	81.0	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)	58.1	81.6	81.0	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
COSA	57.9	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
InternVideo	55.2	-	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VLAB	55.1	87.6	78.8	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Aurora (ours, r=64)	52.4	82	73.9	-
TEFAL	52	86.1	76.6	Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
UCoFiA	49.4	83.5	72.1	Unified Coarse-to-Fine Alignment for Video-Text Retrieval
OmniVL	47.8	83.8	74.2	OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
CLIP4Clip-seqTransf	44.5	81.6	71.4	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
All-in-one + MELTR	38.6	84.7	74.4	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VIOLETv2	37.2	75.8	64.8	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
HD-VILA	35.6	78	65.3	Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
VideoCoCa (zero-shot)	34.3	67.0	57.8	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
MDMMT-2	33.7	70.8	60.5	MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
VIOLET + MELTR	33.6	77.8	63.7	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

0 of 40 row(s) selected.

Command Palette

Video Retrieval On Msr Vtt

Metrics

Results