Video Captioning On Msr Vtt 1

Metriken

BLEU-4

CIDEr

METEOR

ROUGE-L

Ergebnisse

Leistungsergebnisse verschiedener Modelle zu diesem Benchmark

Modellname	BLEU-4	CIDEr	METEOR	ROUGE-L	Paper Title	Repository
HowToCaption	49.8	65.3	32.2	66.3	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
mPLUG-2	57.8	80.0	34.9	70.1	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Vid2Seq	-	64.6	30.8	-	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VideoCoCa	53.8	73.2	-	68.0	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	-
UniVL + MELTR	44.17	52.77	29.26	62.35	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VAST	56.7	78.0	-	-	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
HiTeA	49.2	65.1	30.7	65.0	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
TextKG	46.6	60.8	30.5	64.8	Text with Knowledge Graph Augmented Transformer for Video Captioning	-
COSA	53.7	74.7	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VIOLETv2	-	58	-	-	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CoCap (ViT/L14)	44.4	57.2	30.3	63.4	Accurate and Fast Compressed Video Captioning
IcoCap (ViT-B/16)	47.0	60.2	31.1	64.9	IcoCap: Improving Video Captioning by Compounding Images	-
VLAB	54.6	74.9	33.4	68.3	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
SEM-POS	45.2	53.1	30.7	64.1	SEM-POS: Grammatically and Semantically Correct Video Captioning	-
CLIP-DCD	48.2	58.7	31.3	64.8	CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
IcoCap (ViT-B/32)	46.1	59.1	30.3	64.3	IcoCap: Improving Video Captioning by Compounding Images	-
GIT2	54.8	75.9	33.1	68.2	GIT: A Generative Image-to-text Transformer for Vision and Language
MaMMUT (ours)	-	73.6	-	-	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
RTQ	49.6	69.3	-	66.1	RTQ: Rethinking Video-language Understanding Based on Image-text Model
EMCL-Net	45.3	54.6	30.2	63.2	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

0 of 24 row(s) selected.