TopDown-AlignedAtt (1NN) | 0.593 | - | 0.144 | 0.369 | AudioCaps: Generating Captions for Audios in The Wild | - |
BART + YAMNet + PANNs | 0.753 | - | 0.176 | 0.465 | AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS | |
CNN+Transformer | 0.693 | - | 0.159 | 0.426 | Audio Captioning Transformer | |
Rethink-ACT (AST + TF + MIL) | 0.764 | 0.242 | 0.180 | 0.472 | Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer | - |