THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS

This technical report describes the system participating to the De-tection and Classification of Acoustic Scenes and Events(DCASE) 2021 Challenge, Task 6: automated audio captioning.We use encoder-decoder modeling framework for audio under-standing and caption generation. Our solution focuses on solvingtwo problems in automated audio captioning: data insufficiencyand word selection indeterminacy. As limited audios with goldencaptions are available, we collect large-scale weakly labeled da-taset from Web with heuristic methods. Then we pre-train the en-coder-decoder models with this dataset followed by fine-tuningon Clotho dataset. To solve the word selection indeterminacyproblem, we use keywords extracted from captions of similar au-dios and audio event tags produced by pre-trained models to guidewords generation in decoding stage. We tested our submissionsusing the development-testing dataset. Our best submissionachieved 31.8 SPIDEr score where that of the baseline system is5.4.