Jointly Learning Visual and Auditory Speech Representations from Raw Data

We present RAVEn, a self-supervised multi-modal approach to jointly learnvisual and auditory speech representations. Our pre-training objective involvesencoding masked inputs, and then predicting contextualised targets generated byslowly-evolving momentum encoders. Driven by the inherent differences betweenvideo and audio, our design is asymmetric w.r.t. the two modalities' pretexttasks: Whereas the auditory stream predicts both the visual and auditorytargets, the visual one predicts only the auditory targets. We observe strongresults in low- and high-resource labelled data settings when fine-tuning thevisual and auditory encoders resulting from a single pre-training stage, inwhich the encoders are jointly trained. Notably, RAVEn surpasses allself-supervised methods on visual speech recognition (VSR) on LRS3, andcombining RAVEn with self-training using only 30 hours of labelled data evenoutperforms a recent semi-supervised method trained on 90,000 hours ofnon-public data. At the same time, we achieve state-of-the-art results in theLRS3 low-resource setting for auditory speech recognition (as well as for VSR).Our findings point to the viability of learning powerful speech representationsentirely from raw video and audio, i.e., without relying on handcraftedfeatures. Code and models are available at https://github.com/ahaliassos/raven.