Putting People in their Place: Monocular Regression of 3D People in Depth

Given an image with multiple people, our goal is to directly regress the poseand shape of all the people as well as their relative depth. Inferring thedepth of a person in an image, however, is fundamentally ambiguous withoutknowing their height. This is particularly problematic when the scene containspeople of very different sizes, e.g. from infants to adults. To solve this, weneed several things. First, we develop a novel method to infer the poses anddepth of multiple people in a single image. While previous work that estimatesmultiple people does so by reasoning in the image plane, our method, calledBEV, adds an additional imaginary Bird's-Eye-View representation to explicitlyreason about depth. BEV reasons simultaneously about body centers in the imageand in depth and, by combing these, estimates 3D body position. Unlike priorwork, BEV is a single-shot method that is end-to-end differentiable. Second,height varies with age, making it impossible to resolve depth without alsoestimating the age of people in the image. To do so, we exploit a 3D body modelspace that lets BEV infer shapes from infants to adults. Third, to train BEV,we need a new dataset. Specifically, we create a "Relative Human" (RH) datasetthat includes age labels and relative depth relationships between the people inthe images. Extensive experiments on RH and AGORA demonstrate the effectivenessof the model and training scheme. BEV outperforms existing methods on depthreasoning, child shape estimation, and robustness to occlusion. The code anddataset are released for research purposes.