CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Top-down methods dominate the field of 3D human pose and shape estimation,because they are decoupled from human detection and allow researchers to focuson the core problem. However, cropping, their first step, discards the locationinformation from the very beginning, which makes themselves unable toaccurately predict the global rotation in the original camera coordinatesystem. To address this problem, we propose to Carry Location Information inFull Frames (CLIFF) into this task. Specifically, we feed more holisticfeatures to CLIFF by concatenating the cropped-image feature with its boundingbox information. We calculate the 2D reprojection loss with a broader view ofthe full frame, taking a projection process similar to that of the personprojected in the image. Fed and supervised by global-location-awareinformation, CLIFF directly predicts the global rotation along with moreaccurate articulated poses. Besides, we propose a pseudo-ground-truth annotatorbased on CLIFF, which provides high-quality 3D annotations for in-the-wild 2Ddatasets and offers crucial full supervision for regression-based methods.Extensive experiments on popular benchmarks show that CLIFF outperforms priorarts by a significant margin, and reaches the first place on the AGORAleaderboard (the SMPL-Algorithms track). The code and data are available athttps://github.com/huawei-noah/noah-research/tree/master/CLIFF.