HyperAI

Video Captioning On Msr Vtt 1

Metriken

BLEU-4
CIDEr
METEOR
ROUGE-L

Ergebnisse

Leistungsergebnisse verschiedener Modelle zu diesem Benchmark

Modellname
BLEU-4
CIDEr
METEOR
ROUGE-L
Paper TitleRepository
HowToCaption49.865.332.266.3HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
mPLUG-257.880.034.970.1mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Vid2Seq-64.630.8-Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VideoCoCa53.873.2-68.0VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners-
UniVL + MELTR44.1752.7729.2662.35MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VAST56.778.0--VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
HiTeA49.265.130.765.0HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
TextKG46.660.830.564.8Text with Knowledge Graph Augmented Transformer for Video Captioning-
COSA53.774.7--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VIOLETv2-58--An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CoCap (ViT/L14)44.457.230.363.4Accurate and Fast Compressed Video Captioning
IcoCap (ViT-B/16)47.060.231.164.9IcoCap: Improving Video Captioning by Compounding Images-
VLAB54.674.933.468.3VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending-
SEM-POS45.253.130.764.1SEM-POS: Grammatically and Semantically Correct Video Captioning-
CLIP-DCD48.258.731.364.8CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
IcoCap (ViT-B/32)46.159.130.364.3IcoCap: Improving Video Captioning by Compounding Images-
GIT254.875.933.168.2GIT: A Generative Image-to-text Transformer for Vision and Language
MaMMUT (ours)-73.6--MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
RTQ49.669.3-66.1RTQ: Rethinking Video-language Understanding Based on Image-text Model
EMCL-Net45.354.630.263.2Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
0 of 24 row(s) selected.