HyperAIHyperAI
9 days ago

THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

{Kai Yu, Mengyue Wu, Zeyu Xie, Xuenan Xu}
THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING
Abstract

This report proposes an audio captioning system for the Detectionand Classification of Acoustic Scenes and Events (DCASE) 2021challenge task Task 6. Our audio captioning system consists of a10-layer convolution neural network (CNN) encoder and a tempo-ral attentional single layer gated recurrent unit (GRU) decoder. Inthis challenge, there is no restriction on the usage of external dataand pre-trained models. To better model the concepts in an audioclip, we pre-train the CNN encoder with audio tagging on AudioSet.After standard cross entropy based training, we further fine-tune themodel with reinforcement learning to directly optimize the evalua-tion metric. Experiments show that our proposed system achieves aSPIDEr of 28.6 on the public evaluation split without ensemble1.