Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR)framework, dubbed Zero-AVSR, which enables speech recognition in targetlanguages without requiring any audio-visual speech data in those languages.Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer),which learns language-agnostic speech representations by predicting Roman text.Then, by leveraging the strong multilingual modeling capabilities of LargeLanguage Models (LLMs), we propose converting the predicted Roman text intolanguage-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking ita step further, we explore a unified Zero-AVSR approach by directly integratingthe audio-visual speech representations encoded by the AV-Romanizer into theLLM. This is achieved through finetuning the adapter and the LLM using ourproposed multi-task learning scheme. To capture the wide spectrum of phoneticand linguistic diversity, we also introduce a Multilingual Audio-VisualRomanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech dataacross 82 languages, along with transcriptions in both language-specificgraphemes and Roman text. Extensive analysis and experiments confirm that theproposed Zero-AVSR framework has the potential to expand language supportbeyond the languages seen during the training of the AV-Romanizer.