Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

To facilitate the analysis of human actions, interactions and emotions, wecompute a 3D model of human body pose, hand pose, and facial expression from asingle monocular image. To achieve this, we use thousands of 3D scans to traina new, unified, 3D model of the human body, SMPL-X, that extends SMPL withfully articulated hands and an expressive face. Learning to regress theparameters of SMPL-X directly from images is challenging without paired imagesand 3D ground truth. Consequently, we follow the approach of SMPLify, whichestimates 2D features and then optimizes model parameters to fit the features.We improve on SMPLify in several significant ways: (1) we detect 2D featurescorresponding to the face, hands, and feet and fit the full SMPL-X model tothese; (2) we train a new neural network pose prior using a large MoCapdataset; (3) we define a new interpenetration penalty that is both fast andaccurate; (4) we automatically detect gender and the appropriate body models(male, female, or neutral); (5) our PyTorch implementation achieves a speedupof more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X toboth controlled images and images in the wild. We evaluate 3D accuracy on a newcurated dataset comprising 100 images with pseudo ground-truth. This is a steptowards automatic expressive human capture from monocular RGB data. The models,code, and data are available for research purposes athttps://smpl-x.is.tue.mpg.de.