HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization

Current works on multi-person 3D pose estimation mainly focus on theestimation of the 3D joint locations relative to the root joint and ignore theabsolute locations of each pose. In this paper, we propose the Human DepthEstimation Network (HDNet), an end-to-end framework for absolute root jointlocalization in the camera coordinate space. Our HDNet first estimates the 2Dhuman pose with heatmaps of the joints. These estimated heatmaps serve asattention masks for pooling features from image regions corresponding to thetarget person. A skeleton-based Graph Neural Network (GNN) is utilized topropagate features among joints. We formulate the target depth regression as abin index estimation problem, which can be transformed with a soft-argmaxoperation from the classification output of our HDNet. We evaluate our HDNet onthe root joint localization and root-relative 3D pose estimation tasks with twobenchmark datasets, i.e., Human3.6M and MuPoTS-3D. The experimental resultsshow that we outperform the previous state-of-the-art consistently undermultiple evaluation metrics. Our source code is available at:https://github.com/jiahaoLjh/HumanDepth.