Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Monocular 3D human pose estimation has made progress in recent years. Most ofthe methods focus on single persons, which estimate the poses in theperson-centric coordinates, i.e., the coordinates based on the center of thetarget person. Hence, these methods are inapplicable for multi-person 3D poseestimation, where the absolute coordinates (e.g., the camera coordinates) arerequired. Moreover, multi-person pose estimation is more challenging thansingle pose estimation, due to inter-person occlusion and close humaninteractions. Existing top-down multi-person methods rely on human detection(i.e., top-down approach), and thus suffer from the detection errors and cannotproduce reliable pose estimation in multi-person scenes. Meanwhile, existingbottom-up methods that do not use human detection are not affected by detectionerrors, but since they process all persons in a scene at once, they are proneto errors, particularly for persons in small scales. To address all thesechallenges, we propose the integration of top-down and bottom-up approaches toexploit their strengths. Our top-down network estimates human joints from allpersons instead of one in an image patch, making it robust to possibleerroneous bounding boxes. Our bottom-up network incorporates human-detectionbased normalized heatmaps, allowing the network to be more robust in handlingscale variations. Finally, the estimated 3D poses from the top-down andbottom-up networks are fed into our integration network for final 3D poses. Toaddress the common gaps between training and testing data, we do optimizationduring the test time, by refining the estimated 3D human poses using high-ordertemporal constraint, re-projection loss, and bone length regularizations. Ourevaluations demonstrate the effectiveness of the proposed method. Code andmodels are available: https://github.com/3dpose/3D-Multi-Person-Pose.