Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

In visual speech processing, context modeling capability is one of the mostimportant requirements due to the ambiguous nature of lip movements. Forexample, homophenes, words that share identical lip movements but producedifferent sounds, can be distinguished by considering the context. In thispaper, we propose a novel framework, namely Visual Speech Processingincorporated with LLMs (VSP-LLM), to maximize the context modeling ability bybringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed toperform multi-tasks of visual speech recognition and translation, where thegiven instructions control the type of task. The input video is mapped to theinput latent space of an LLM by employing a self-supervised visual speechmodel. Focused on the fact that there is redundant information in input frames,we propose a novel deduplication method that reduces the embedded visualfeatures by employing visual speech units. Through the proposed deduplicationand Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationallyefficient manner. In the translation dataset, the MuAViC benchmark, wedemonstrate that VSP-LLM trained on just 30 hours of labeled data can moreeffectively translate lip movements compared to the recent model trained with433 hours of data.