HyperAI

Audio Captioning On Audiocaps

Metrics

CIDEr
METEOR
SPICE
SPIDEr

Results

Performance results of various models on this benchmark

Model Name
CIDEr
METEOR
SPICE
SPIDEr
Paper TitleRepository
EnCLAP-large0.80290.25540.18790.4954EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
EnCLAP++-large0.8230.2690.1970.510EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
VAST0.7810.247--VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
AL-MixGen0.755-0.1770.466Exploring Train and Test-Time Augmentations for Audio-Language Learning-
SLAM-AAC0.8410.2680.1940.518SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
LOAE0.8160.2670.1930.505Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
TopDown-AlignedAtt (1NN)0.593-0.1440.369AudioCaps: Generating Captions for Audios in The Wild-
AL-MixGen + Multi-TTA0.769-0.1810.475--
EnCLAP++-base0.8150.2570.1880.501EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
CNext-trans0.80610.25270.18410.4951--
BART + YAMNet + PANNs0.753-0.1760.465AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS
EnCLAP-base0.77950.24730.18630.4829EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
CNN+Transformer0.693-0.1590.426Audio Captioning Transformer
VALOR0.7410.231--VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
AutoCap0.8320.2530.1820.507Taming Data and Transformers for Audio Generation
Rethink-ACT (AST + TF + MIL)0.7640.2420.1800.472Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer-
0 of 16 row(s) selected.