8 months ago

Multimodal Representation

Audio Recognition

Haoxing Chen Yaohui Li Yan Hong Zizheng Huang Zhuoer Xu Zhangxuan Gu Jun Lan Huijia Zhu Weiqiang Wang

Abstract

Audio-visual zero-shot learning aims to recognize unseen classes based onpaired audio-visual sequences. Recent methods mainly focus on learningmulti-modal features aligned with class names to enhance the generalizationability to unseen categories. However, these approaches ignore the obscureevent concepts in class names and may inevitably introduce complex networkstructures with difficult training objectives. In this paper, we introduce astraightforward yet efficient framework called KnowleDge-Augmented audio-visuallearning (KDA), which aids the model in more effectively learning novel eventcontent by leveraging an external knowledge base. Specifically, we firstpropose to utilize the knowledge contained in large language models (LLMs) togenerate numerous descriptive sentences that include important distinguishingaudio-visual features of event classes, which helps to better understand unseencategories. Furthermore, we propose a knowledge-aware adaptive margin loss tohelp distinguish similar events, further improving the generalization abilitytowards unseen classes. Extensive experimental results demonstrate that ourproposed KDA can outperform state-of-the-art methods on three popularaudio-visual zero-shot learning datasets.Our code will be avaliable at\url{https://github.com/chenhaoxing/KDA}.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Audio Recognition

Haoxing Chen Yaohui Li Yan Hong Zizheng Huang Zhuoer Xu Zhangxuan Gu Jun Lan Huijia Zhu Weiqiang Wang

Abstract

Audio-visual zero-shot learning aims to recognize unseen classes based onpaired audio-visual sequences. Recent methods mainly focus on learningmulti-modal features aligned with class names to enhance the generalizationability to unseen categories. However, these approaches ignore the obscureevent concepts in class names and may inevitably introduce complex networkstructures with difficult training objectives. In this paper, we introduce astraightforward yet efficient framework called KnowleDge-Augmented audio-visuallearning (KDA), which aids the model in more effectively learning novel eventcontent by leveraging an external knowledge base. Specifically, we firstpropose to utilize the knowledge contained in large language models (LLMs) togenerate numerous descriptive sentences that include important distinguishingaudio-visual features of event classes, which helps to better understand unseencategories. Furthermore, we propose a knowledge-aware adaptive margin loss tohelp distinguish similar events, further improving the generalization abilitytowards unseen classes. Extensive experimental results demonstrate that ourproposed KDA can outperform state-of-the-art methods on three popularaudio-visual zero-shot learning datasets.Our code will be avaliable at\url{https://github.com/chenhaoxing/KDA}.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp