2 months ago

Visual Speech Recognition for Multiple Languages in the Wild

Ma, Pingchuan ; Petridis, Stavros ; Pantic, Maja

Abstract

Visual speech recognition (VSR) aims to recognize the content of speech basedon lip movements, without relying on the audio stream. Advances in deeplearning and the availability of large audio-visual datasets have led to thedevelopment of much more accurate and robust VSR models than ever before.However, these advances are usually due to the larger training sets rather thanthe model design. Here we demonstrate that designing better models is equallyas important as using larger training sets. We propose the addition ofprediction-based auxiliary tasks to a VSR model, and highlight the importanceof hyperparameter optimization and appropriate data augmentations. We show thatsuch a model works for different languages and outperforms all previous methodstrained on publicly available datasets by a large margin. It even outperformsmodels that were trained on non-publicly available datasets containing up to to21 times more data. We show, furthermore, that using additional training data,even in other languages or with automatically generated transcriptions, resultsin further improvement.