Command Palette
Search for a command to run...
Conformers are All You Need for Visual Speech Recognition
Conformers are All You Need for Visual Speech Recognition
Oscar Chang Hank Liao Dmitriy Serdyuk Ankit Shah† Olivier Siohan
Abstract
Visual speech recognition models extract visual features in a hierarchicalmanner. At the lower level, there is a visual front-end with a limited temporalreceptive field that processes the raw pixels depicting the lips or faces. Atthe higher level, there is an encoder that attends to the embeddings producedby the front-end over a large temporal receptive field. Previous work hasfocused on improving the visual front-end of the model to extract more usefulfeatures for speech recognition. Surprisingly, our work shows that complexvisual front-ends are not necessary. Instead of allocating resources to asophisticated visual front-end, we find that a linear visual front-end pairedwith a larger Conformer encoder results in lower latency, more efficient memoryusage, and improved WER performance. We achieve a new state-of-the-art of 12.8%WER for visual speech recognition on the TED LRS3 dataset, which rivals theperformance of audio-only models from just four years ago.