MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition innoisy environments by combining auditory and visual information. However,recent Large Language Model (LLM) based AVSR systems incur high computationalcosts due to the high temporal resolution of audio-visual speech processed byLLMs. In this work, we introduce an efficient multimodal speech LLM frameworkthat minimizes token length while preserving essential linguistic content. Ourapproach employs an early AV-fusion module for streamlined feature integration,an audio-visual speech Q-Former that dynamically allocates tokens based oninput duration, and a refined query allocation strategy with a speech ratepredictor to adjust token allocation according to speaking speed of each audiosample. Extensive experiments on the LRS3 dataset show that our method achievesstate-of-the-art performance with a WER of 0.72% while using only 3.5 tokensper second. Moreover, our approach not only reduces token usage by 86% comparedto the previous multimodal speech LLM framework, but also improvescomputational efficiency by reducing FLOPs by 35.7%.