HyperAIHyperAI
2 months ago

Conformers are All You Need for Visual Speech Recognition

Chang, Oscar ; Liao, Hank ; Serdyuk, Dmitriy ; Shah, Ankit ; Siohan, Olivier
Conformers are All You Need for Visual Speech Recognition
Abstract

Visual speech recognition models extract visual features in a hierarchicalmanner. At the lower level, there is a visual front-end with a limited temporalreceptive field that processes the raw pixels depicting the lips or faces. Atthe higher level, there is an encoder that attends to the embeddings producedby the front-end over a large temporal receptive field. Previous work hasfocused on improving the visual front-end of the model to extract more usefulfeatures for speech recognition. Surprisingly, our work shows that complexvisual front-ends are not necessary. Instead of allocating resources to asophisticated visual front-end, we find that a linear visual front-end pairedwith a larger Conformer encoder results in lower latency, more efficient memoryusage, and improved WER performance. We achieve a new state-of-the-art of 12.8%WER for visual speech recognition on the TED LRS3 dataset, which rivals theperformance of audio-only models from just four years ago.