HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
Video Captioning
Video Captioning On Msr Vtt 1
Video Captioning On Msr Vtt 1
المقاييس
BLEU-4
CIDEr
METEOR
ROUGE-L
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
BLEU-4
CIDEr
METEOR
ROUGE-L
Paper Title
Repository
HowToCaption
49.8
65.3
32.2
66.3
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
mPLUG-2
57.8
80.0
34.9
70.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Vid2Seq
-
64.6
30.8
-
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VideoCoCa
53.8
73.2
-
68.0
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
UniVL + MELTR
44.17
52.77
29.26
62.35
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VAST
56.7
78.0
-
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
HiTeA
49.2
65.1
30.7
65.0
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
TextKG
46.6
60.8
30.5
64.8
Text with Knowledge Graph Augmented Transformer for Video Captioning
-
COSA
53.7
74.7
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VIOLETv2
-
58
-
-
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CoCap (ViT/L14)
44.4
57.2
30.3
63.4
Accurate and Fast Compressed Video Captioning
IcoCap (ViT-B/16)
47.0
60.2
31.1
64.9
IcoCap: Improving Video Captioning by Compounding Images
-
VLAB
54.6
74.9
33.4
68.3
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
-
SEM-POS
45.2
53.1
30.7
64.1
SEM-POS: Grammatically and Semantically Correct Video Captioning
-
CLIP-DCD
48.2
58.7
31.3
64.8
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
IcoCap (ViT-B/32)
46.1
59.1
30.3
64.3
IcoCap: Improving Video Captioning by Compounding Images
-
GIT2
54.8
75.9
33.1
68.2
GIT: A Generative Image-to-text Transformer for Vision and Language
MaMMUT (ours)
-
73.6
-
-
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
RTQ
49.6
69.3
-
66.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
EMCL-Net
45.3
54.6
30.2
63.2
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
0 of 24 row(s) selected.
Previous
Next