2 months ago

Sub-word Level Lip Reading With Visual Attention

Prajwal, K R ; Afouras, Triantafyllos ; Zisserman, Andrew

Abstract

The goal of this paper is to learn strong lip reading models that canrecognise speech in silent videos. Most prior works deal with the open-setvisual speech recognition problem by adapting existing automatic speechrecognition techniques on top of trivially pooled visual features. Instead, inthis paper we focus on the unique challenges encountered in lip reading andpropose tailored solutions. To this end, we make the following contributions:(1) we propose an attention-based pooling mechanism to aggregate visual speechrepresentations; (2) we use sub-word units for lip reading for the first timeand show that this allows us to better model the ambiguities of the task; (3)we propose a model for Visual Speech Detection (VSD), trained on top of the lipreading network. Following the above, we obtain state-of-the-art results on thechallenging LRS2 and LRS3 benchmarks when training on public datasets, and evensurpass models trained on large-scale industrial datasets by using an order ofmagnitude less data. Our best model achieves 22.6% word error rate on the LRS2dataset, a performance unprecedented for lip reading models, significantlyreducing the performance gap between lip reading and automatic speechrecognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD modelsurpasses all visual-only baselines and even outperforms several recentaudio-visual methods.