HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Audio Captioning
Audio Captioning On Audiocaps
Audio Captioning On Audiocaps
Metriken
CIDEr
METEOR
SPICE
SPIDEr
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
CIDEr
METEOR
SPICE
SPIDEr
Paper Title
Repository
EnCLAP-large
0.8029
0.2554
0.1879
0.4954
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
EnCLAP++-large
0.823
0.269
0.197
0.510
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
VAST
0.781
0.247
-
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
AL-MixGen
0.755
-
0.177
0.466
Exploring Train and Test-Time Augmentations for Audio-Language Learning
-
SLAM-AAC
0.841
0.268
0.194
0.518
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
LOAE
0.816
0.267
0.193
0.505
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
TopDown-AlignedAtt (1NN)
0.593
-
0.144
0.369
AudioCaps: Generating Captions for Audios in The Wild
-
AL-MixGen + Multi-TTA
0.769
-
0.181
0.475
-
-
EnCLAP++-base
0.815
0.257
0.188
0.501
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
CNext-trans
0.8061
0.2527
0.1841
0.4951
-
-
BART + YAMNet + PANNs
0.753
-
0.176
0.465
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS
EnCLAP-base
0.7795
0.2473
0.1863
0.4829
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
CNN+Transformer
0.693
-
0.159
0.426
Audio Captioning Transformer
VALOR
0.741
0.231
-
-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
AutoCap
0.832
0.253
0.182
0.507
Taming Data and Transformers for Audio Generation
Rethink-ACT (AST + TF + MIL)
0.764
0.242
0.180
0.472
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
-
0 of 16 row(s) selected.
Previous
Next