HyperAI超神経

Video Captioning On Msvd 1

評価指標

BLEU-4
CIDEr
METEOR
ROUGE-L

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
BLEU-4
CIDEr
METEOR
ROUGE-L
Paper TitleRepository
IcoCap (ViT-B/32)56.3103.838.975.0IcoCap: Improving Video Captioning by Compounding Images-
RTQ66.9123.4-82.2RTQ: Rethinking Video-language Understanding Based on Image-text Model
VALOR80.7178.551.087.9VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
COSA76.5178.5--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
mPLUG-270.5165.848.485.3mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CoCap (ViT/L14)60.1121.541.478.2Accurate and Fast Compressed Video Captioning
MaMMUT-195.6--MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VASTA (Vatex-backbone)59.2119.740.6576.7Diverse Video Captioning by Adaptive Spatio-temporal Attention
IcoCap (ViT-B/16)59.1110.339.576.5IcoCap: Improving Video Captioning by Compounding Images-
VASTA (Kinetics-backbone)56.1106.439.174.5Diverse Video Captioning by Adaptive Spatio-temporal Attention
VIOLETv2-139.2--An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Vid2Seq-146.245.3-Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VLAB79.3179.851.287.9VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending-
HowToCaption70.4154.246.483.2HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
SEM-POS60.1108.338.576.0SEM-POS: Grammatically and Semantically Correct Video Captioning-
HiTeA71.0146.945.381.4HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
0 of 16 row(s) selected.