HyperAIHyperAI

Command Palette

Search for a command to run...

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Jeong Hun Yeo Minsu Kim Chae Won Kim Stavros Petridis Yong Man Ro

Abstract

We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR)framework, dubbed Zero-AVSR, which enables speech recognition in targetlanguages without requiring any audio-visual speech data in those languages.Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer),which learns language-agnostic speech representations by predicting Roman text.Then, by leveraging the strong multilingual modeling capabilities of LargeLanguage Models (LLMs), we propose converting the predicted Roman text intolanguage-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking ita step further, we explore a unified Zero-AVSR approach by directly integratingthe audio-visual speech representations encoded by the AV-Romanizer into theLLM. This is achieved through finetuning the adapter and the LLM using ourproposed multi-task learning scheme. To capture the wide spectrum of phoneticand linguistic diversity, we also introduce a Multilingual Audio-VisualRomanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech dataacross 82 languages, along with transcriptions in both language-specificgraphemes and Roman text. Extensive analysis and experiments confirm that theproposed Zero-AVSR framework has the potential to expand language supportbeyond the languages seen during the training of the AV-Romanizer.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp