Training Strategies for Improved Lip-reading

Several training strategies and temporal models have been recently proposedfor isolated word lip-reading in a series of independent works. However, thepotential of combining the best strategies and investigating the impact of eachof them has not been explored. In this paper, we systematically investigate theperformance of state-of-the-art data augmentation approaches, temporal modelsand other training strategies, like self-distillation and using word boundaryindicators. Our results show that Time Masking (TM) is the most importantaugmentation followed by mixup and Densely-Connected Temporal ConvolutionalNetworks (DC-TCN) are the best temporal model for lip-reading of isolatedwords. Using self-distillation and word boundary indicators is also beneficialbut to a lesser extent. A combination of all the above methods results in aclassification accuracy of 93.4%, which is an absolute improvement of 4.6% overthe current state-of-the-art performance on the LRW dataset. The performancecan be further improved to 94.1% by pre-training on additional datasets. Anerror analysis of the various training strategies reveals that the performanceimproves by increasing the classification accuracy of hard-to-recognise words.