HyperAI초신경

홈 뉴스 연구 논문 튜토리얼 데이터셋 백과사전 SOTA LLM 모델 GPU 랭킹 컨퍼런스

한국어

HyperAI초신경

Video Captioning On Msvd 1

평가 지표

BLEU-4

CIDEr

METEOR

ROUGE-L

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	BLEU-4	CIDEr	METEOR	ROUGE-L	Paper Title	Repository
IcoCap (ViT-B/32)	56.3	103.8	38.9	75.0	IcoCap: Improving Video Captioning by Compounding Images	-
RTQ	66.9	123.4	-	82.2	RTQ: Rethinking Video-language Understanding Based on Image-text Model
VALOR	80.7	178.5	51.0	87.9	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
COSA	76.5	178.5	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
mPLUG-2	70.5	165.8	48.4	85.3	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CoCap (ViT/L14)	60.1	121.5	41.4	78.2	Accurate and Fast Compressed Video Captioning
MaMMUT	-	195.6	-	-	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VASTA (Vatex-backbone)	59.2	119.7	40.65	76.7	Diverse Video Captioning by Adaptive Spatio-temporal Attention
IcoCap (ViT-B/16)	59.1	110.3	39.5	76.5	IcoCap: Improving Video Captioning by Compounding Images	-
VASTA (Kinetics-backbone)	56.1	106.4	39.1	74.5	Diverse Video Captioning by Adaptive Spatio-temporal Attention
VIOLETv2	-	139.2	-	-	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Vid2Seq	-	146.2	45.3	-	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VLAB	79.3	179.8	51.2	87.9	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
HowToCaption	70.4	154.2	46.4	83.2	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
SEM-POS	60.1	108.3	38.5	76.0	SEM-POS: Grammatically and Semantically Correct Video Captioning	-
HiTeA	71.0	146.9	45.3	81.4	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-

0 of 16 row(s) selected.