HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
Audio Captioning
Audio Captioning On Audiocaps
Audio Captioning On Audiocaps
المقاييس
CIDEr
METEOR
SPICE
SPIDEr
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
CIDEr
METEOR
SPICE
SPIDEr
Paper Title
Repository
EnCLAP-large
0.8029
0.2554
0.1879
0.4954
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
EnCLAP++-large
0.823
0.269
0.197
0.510
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
VAST
0.781
0.247
-
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
AL-MixGen
0.755
-
0.177
0.466
Exploring Train and Test-Time Augmentations for Audio-Language Learning
-
SLAM-AAC
0.841
0.268
0.194
0.518
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
LOAE
0.816
0.267
0.193
0.505
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
TopDown-AlignedAtt (1NN)
0.593
-
0.144
0.369
AudioCaps: Generating Captions for Audios in The Wild
-
AL-MixGen + Multi-TTA
0.769
-
0.181
0.475
-
-
EnCLAP++-base
0.815
0.257
0.188
0.501
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
CNext-trans
0.8061
0.2527
0.1841
0.4951
-
-
BART + YAMNet + PANNs
0.753
-
0.176
0.465
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS
EnCLAP-base
0.7795
0.2473
0.1863
0.4829
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
CNN+Transformer
0.693
-
0.159
0.426
Audio Captioning Transformer
VALOR
0.741
0.231
-
-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
AutoCap
0.832
0.253
0.182
0.507
Taming Data and Transformers for Audio Generation
Rethink-ACT (AST + TF + MIL)
0.764
0.242
0.180
0.472
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
-
0 of 16 row(s) selected.
Previous
Next