HyperAIHyperAI
2 months ago

Boosting Audio-visual Zero-shot Learning with Large Language Models

Chen, Haoxing ; Li, Yaohui ; Hong, Yan ; Huang, Zizheng ; Xu, Zhuoer ; Gu, Zhangxuan ; Lan, Jun ; Zhu, Huijia ; Wang, Weiqiang
Boosting Audio-visual Zero-shot Learning with Large Language Models
Abstract

Audio-visual zero-shot learning aims to recognize unseen classes based onpaired audio-visual sequences. Recent methods mainly focus on learningmulti-modal features aligned with class names to enhance the generalizationability to unseen categories. However, these approaches ignore the obscureevent concepts in class names and may inevitably introduce complex networkstructures with difficult training objectives. In this paper, we introduce astraightforward yet efficient framework called KnowleDge-Augmented audio-visuallearning (KDA), which aids the model in more effectively learning novel eventcontent by leveraging an external knowledge base. Specifically, we firstpropose to utilize the knowledge contained in large language models (LLMs) togenerate numerous descriptive sentences that include important distinguishingaudio-visual features of event classes, which helps to better understand unseencategories. Furthermore, we propose a knowledge-aware adaptive margin loss tohelp distinguish similar events, further improving the generalization abilitytowards unseen classes. Extensive experimental results demonstrate that ourproposed KDA can outperform state-of-the-art methods on three popularaudio-visual zero-shot learning datasets.Our code will be avaliable at\url{https://github.com/chenhaoxing/KDA}.