MARLIN: Masked Autoencoder for facial video Representation LearnINg

This paper proposes a self-supervised approach to learn universal facialrepresentations from videos, that can transfer across a variety of facialanalysis tasks such as Facial Attribute Recognition (FAR), Facial ExpressionRecognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Ourproposed framework, named MARLIN, is a facial video masked autoencoder, thatlearns highly robust and generic facial embeddings from abundantly availablenon-annotated web crawled facial videos. As a challenging auxiliary task,MARLIN reconstructs the spatio-temporal details of the face from the denselymasked facial regions which mainly include eyes, nose, mouth, lips, and skin tocapture local and global aspects that in turn help in encoding generic andtransferable features. Through a variety of experiments on diverse downstreamtasks, we demonstrate MARLIN to be an excellent facial video encoder as well asfeature extractor, that performs consistently well across a variety ofdownstream tasks including FAR (1.13% gain over supervised benchmark), FER(2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervisedbenchmark), LS (29.36% gain for Frechet Inception Distance), and even in lowdata regime. Our code and models are available athttps://github.com/ControlNet/MARLIN .