Synthetic Training for Accurate 3D Human Pose and Shape Estimation in the Wild

This paper addresses the problem of monocular 3D human shape and poseestimation from an RGB image. Despite great progress in this field in terms ofpose prediction accuracy, state-of-the-art methods often predict inaccuratebody shapes. We suggest that this is primarily due to the scarcity ofin-the-wild training data with diverse and accurate body shape labels. Thus, wepropose STRAPS (Synthetic Training for Real Accurate Pose and Shape), a systemthat utilises proxy representations, such as silhouettes and 2D joints, asinputs to a shape and pose regression neural network, which is trained withsynthetic training data (generated on-the-fly during training using the SMPLstatistical body model) to overcome data scarcity. We bridge the gap betweensynthetic training inputs and noisy real inputs, which are predicted bykeypoint detection and segmentation CNNs at test-time, by using dataaugmentation and corruption during training. In order to evaluate our approach,we curate and provide a challenging evaluation dataset for monocular humanshape estimation, Sports Shape and Pose 3D (SSP-3D). It consists of RGB imagesof tightly-clothed sports-persons with a variety of body shapes andcorresponding pseudo-ground-truth SMPL shape and pose parameters, obtained viamulti-frame optimisation. We show that STRAPS outperforms otherstate-of-the-art methods on SSP-3D in terms of shape prediction accuracy, whileremaining competitive with the state-of-the-art on pose-centric datasets andmetrics.