HyperAIHyperAI
9 days ago

LCANet: End-to-End Lipreading with Cascaded Attention-CTC

{Nick Cassimatis, Xiaolong Wang, Kai Xu, Dawei Li}
LCANet: End-to-End Lipreading with Cascaded Attention-CTC
Abstract

Machine lipreading is a special type of automatic speech recognition (ASR)which transcribes human speech by visually interpreting the movement of relatedface regions including lips, face, and tongue. Recently, deep neural networkbased lipreading methods show great potential and have exceeded the accuracy ofexperienced human lipreaders in some benchmark datasets. However, lipreading isstill far from being solved, and existing methods tend to have high error rateson the wild data. In this paper, we propose LCANet, an end-to-end deep neuralnetwork based lipreading system. LCANet encodes input video frames using astacked 3D convolutional neural network (CNN), highway network andbidirectional GRU network. The encoder effectively captures both short-term andlong-term spatio-temporal information. More importantly, LCANet incorporates acascaded attention-CTC decoder to generate output texts. By cascading CTC withattention, it partially eliminates the defect of the conditional independenceassumption of CTC within the hidden neural layers, and this yields notablyperformance improvement as well as faster convergence. The experimental resultsshow the proposed system achieves a 1.3% CER and 3.0% WER on the GRID corpusdatabase, leading to a 12.3% improvement compared to the state-of-the-artmethods.

LCANet: End-to-End Lipreading with Cascaded Attention-CTC | Latest Papers | HyperAI