8 months ago

Abstract

Lip reading, also known as visual speech recognition, aims to recognize thespeech content from videos by analyzing the lip dynamics. There have beenseveral appealing progress in recent years, benefiting much from the rapidlydeveloped deep learning techniques and the recent large-scale lip-readingdatasets. Most existing methods obtained high performance by constructing acomplex neural network, together with several customized training strategieswhich were always given in a very brief description or even shown only in thesource code. We find that making proper use of these strategies could alwaysbring exciting improvements without changing much of the model. Considering thenon-negligible effects of these strategies and the existing tough status totrain an effective lip reading model, we perform a comprehensive quantitativestudy and comparative analysis, for the first time, to show the effects ofseveral different choices for lip reading. By only introducing some easy-to-getrefinements to the baseline pipeline, we obtain an obvious improvement of theperformance from 83.7% to 88.4% and from 38.2% to 55.7% on two largest publicavailable lip reading datasets, LRW and LRW-1000, respectively. They arecomparable and even surpass the existing state-of-the-art results.

Source PDF