2 months ago

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Yang, Sen ; Heng, Wen ; Liu, Gang ; Luo, Guozhong ; Yang, Wankou ; Yu, Gang

Abstract

In this paper we present a novel method to estimate 3D human pose and shapefrom monocular videos. This task requires directly recovering pixel-alignment3D human pose and body shape from monocular images or videos, which ischallenging due to its inherent ambiguity. To improve precision, existingmethods highly rely on the initialized mean pose and shape as prior estimatesand parameter regression with an iterative error feedback manner. In addition,video-based approaches model the overall change over the image-level featuresto temporally enhance the single-frame feature, but fail to capture therotational motion at the joint level, and cannot guarantee local temporalconsistency. To address these issues, we propose a novel Transformer-basedmodel with a design of independent tokens. First, we introduce three types oftokens independent of the image feature: \textit{joint rotation tokens, shapetoken, and camera token}. By progressively interacting with image featuresthrough Transformer layers, these tokens learn to encode the prior knowledge ofhuman 3D joint rotations, body shape, and position information from large-scaledata, and are updated to estimate SMPL parameters conditioned on a given image.Second, benefiting from the proposed token-based representation, we further usea temporal model to focus on capturing the rotational temporal information ofeach joint, which is empirically conducive to preventing large jitters in localparts. Despite being conceptually simple, the proposed method attains superiorperformances on the 3DPW and Human3.6M datasets. Using ResNet-50 andTransformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric ofthe challenging 3DPW, outperforming state-of-the-art counterparts by a largemargin. Code will be publicly available athttps://github.com/yangsenius/INT_HMR_Model