ExpNet: Landmark-Free, Deep, 3D Facial Expressions

We describe a deep learning based method for estimating 3D facial expressioncoefficients. Unlike previous work, our process does not relay on faciallandmark detection methods as a proxy step. Recent methods have shown that aCNN can be trained to regress accurate and discriminative 3D morphable model(3DMM) representations, directly from image intensities. By foregoing faciallandmark detection, these methods were able to estimate shapes for occludedfaces appearing in unprecedented in-the-wild viewing conditions. We build onthose methods by showing that facial expressions can also be estimated by arobust, deep, landmark-free approach. Our ExpNet CNN is applied directly to theintensities of a face image and regresses a 29D vector of 3D expressioncoefficients. We propose a unique method for collecting data to train thisnetwork, leveraging on the robustness of deep networks to training label noise.We further offer a novel means of evaluating the accuracy of estimatedexpression coefficients: by measuring how well they capture facial emotions onthe CK+ and EmotiW-17 emotion recognition benchmarks. We show that our ExpNetproduces expression coefficients which better discriminate between facialemotions than those obtained using state of the art, facial landmark detectiontechniques. Moreover, this advantage grows as image scales drop, demonstratingthat our ExpNet is more robust to scale changes than landmark detectionmethods. Finally, at the same level of accuracy, our ExpNet is orders ofmagnitude faster than its alternatives.