6 months ago

Audio and Speech Processing

Audio Classification

Zhen Yang Xiang Li Dong Liu Qichen Han∗ Weiqiang Yuan ∗

Abstract

This technical report describes the system participating to the De-tection and Classification of Acoustic Scenes and Events(DCASE) 2021 Challenge, Task 6: automated audio captioning.We use encoder-decoder modeling framework for audio under-standing and caption generation. Our solution focuses on solvingtwo problems in automated audio captioning: data insufficiencyand word selection indeterminacy. As limited audios with goldencaptions are available, we collect large-scale weakly labeled da-taset from Web with heuristic methods. Then we pre-train the en-coder-decoder models with this dataset followed by fine-tuningon Clotho dataset. To solve the word selection indeterminacyproblem, we use keywords extracted from captions of similar au-dios and audio event tags produced by pre-trained models to guidewords generation in decoding stage. We tested our submissionsusing the development-testing dataset. Our best submissionachieved 31.8 SPIDEr score where that of the baseline system is5.4.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

6 months ago

Audio and Speech Processing

Audio Classification

Zhen Yang Xiang Li Dong Liu Qichen Han∗ Weiqiang Yuan ∗

Abstract

This technical report describes the system participating to the De-tection and Classification of Acoustic Scenes and Events(DCASE) 2021 Challenge, Task 6: automated audio captioning.We use encoder-decoder modeling framework for audio under-standing and caption generation. Our solution focuses on solvingtwo problems in automated audio captioning: data insufficiencyand word selection indeterminacy. As limited audios with goldencaptions are available, we collect large-scale weakly labeled da-taset from Web with heuristic methods. Then we pre-train the en-coder-decoder models with this dataset followed by fine-tuningon Clotho dataset. To solve the word selection indeterminacyproblem, we use keywords extracted from captions of similar au-dios and audio event tags produced by pre-trained models to guidewords generation in decoding stage. We tested our submissionsusing the development-testing dataset. Our best submissionachieved 31.8 SPIDEr score where that of the baseline system is5.4.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp