Audio Captioning
Audio Captioning is the task of converting audio content into textual descriptions. Its core objective is to capture audio features through an audio encoder and then generate natural language descriptions using a decoder, to achieve accurate understanding and expression of audio content. This task holds significant application value in multimedia content analysis and assisting the hearing impaired in understanding audio information. However, existing machine translation and image captioning evaluation metrics are not entirely suitable for assessing the quality of audio descriptions, so researchers are exploring the use of novel evaluation methods such as pre-trained language models like Sentence-BERT.