Command Palette
Search for a command to run...
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
Sung Won Han Seungjin Lee Dongwon Kim Jin Sob Kim Hyun Joon Park WooSeok Shin
Abstract
The performance of automated audio captioning (AAC) has been improved considerably through a transformer-based encoder and transfer learning. However, their performance improvement is constrained by the following problems: (1) discrepancy in the input patch size between pretraining and fine-tuning steps. (2) lack of local-level relations between inputs and captions. In this paper, we propose a simple transfer learning scheme that maintains input patch sizes, unlike previous methods, to avoid input discrepancies. Furthermore, we propose a patch-wise keyword estimation branch that utilizes an attention pooling method to effectively represent both global- and local-level information. The results on the AudioCaps dataset reveal that the proposed learning scheme and method considerably contribute to performance gain. Finally, the visualization results demonstrate that the proposed attention-pooling method effectively detects local-level information in the AAC system.