A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training

This work develops a continuous sign language (SL)recognition framework with deep neural networks, which directlytranscribes videos of SL sentences to sequences of ordered glosslabels. Previous methods dealing with continuous SL recognitionusually employ hidden Markov models with limited capacity tocapture the temporal information. In contrast, our proposedarchitecture adopts deep convolutional neural networks withstacked temporal fusion layers as the feature extraction module,and bi-directional recurrent neural networks as the sequencelearning module. We propose an iterative optimization processfor our architecture to fully exploit the representation capabilityof deep neural networks with limited data. We first train theend-to-end recognition model for alignment proposal, and thenuse the alignment proposal as strong supervisory informationto directly tune the feature extraction module. This trainingprocess can run iteratively to achieve improvements on therecognition performance. We further contribute by exploringthe multimodal fusion of RGB images and optical flow insign language. Our method is evaluated on two challenging SLrecognition benchmarks, and outperforms the state-of-the-art bya relative improvement of more than 15% on both databases.