Self-Attentive 3D Human Pose and Shape Estimation from Videos

We consider the task of estimating 3D human pose and shape from videos. Whileexisting frame-based approaches have made significant progress, these methodsare independently applied to each image, thereby often leading to inconsistentpredictions. In this work, we present a video-based learning algorithm for 3Dhuman pose and shape estimation. The key insights of our method are two-fold.First, to address the inconsistent temporal prediction issue, we exploittemporal information in videos and propose a self-attention module that jointlyconsiders short-range and long-range dependencies across frames, resulting intemporally coherent estimations. Second, we model human motion with aforecasting module that allows the transition between adjacent frames to besmooth. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6Mdatasets. Extensive experimental results show that our algorithm performsfavorably against the state-of-the-art methods.