Home News Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Video Captioning On Msvd 1

Metrics

BLEU-4

CIDEr

METEOR

ROUGE-L

Results

Performance results of various models on this benchmark

Model Name	BLEU-4	CIDEr	METEOR	ROUGE-L	Paper Title	Repository
IcoCap (ViT-B/32)	56.3	103.8	38.9	75.0	IcoCap: Improving Video Captioning by Compounding Images	-
RTQ	66.9	123.4	-	82.2	RTQ: Rethinking Video-language Understanding Based on Image-text Model
VALOR	80.7	178.5	51.0	87.9	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
COSA	76.5	178.5	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
mPLUG-2	70.5	165.8	48.4	85.3	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CoCap (ViT/L14)	60.1	121.5	41.4	78.2	Accurate and Fast Compressed Video Captioning
MaMMUT	-	195.6	-	-	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VASTA (Vatex-backbone)	59.2	119.7	40.65	76.7	Diverse Video Captioning by Adaptive Spatio-temporal Attention
IcoCap (ViT-B/16)	59.1	110.3	39.5	76.5	IcoCap: Improving Video Captioning by Compounding Images	-
VASTA (Kinetics-backbone)	56.1	106.4	39.1	74.5	Diverse Video Captioning by Adaptive Spatio-temporal Attention
VIOLETv2	-	139.2	-	-	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Vid2Seq	-	146.2	45.3	-	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VLAB	79.3	179.8	51.2	87.9	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
HowToCaption	70.4	154.2	46.4	83.2	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
SEM-POS	60.1	108.3	38.5	76.0	SEM-POS: Grammatically and Semantically Correct Video Captioning	-
HiTeA	71.0	146.9	45.3	81.4	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-

0 of 16 row(s) selected.