Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision

Egocentric 3D human pose estimation with a single fisheye camera has drawn asignificant amount of attention recently. However, existing methods strugglewith pose estimation from in-the-wild images, because they can only be trainedon synthetic data due to the unavailability of large-scale in-the-wildegocentric datasets. Furthermore, these methods easily fail when the body partsare occluded by or interacting with the surrounding scene. To address theshortage of in-the-wild data, we collect a large-scale in-the-wild egocentricdataset called Egocentric Poses in the Wild (EgoPW). This dataset is capturedby a head-mounted fisheye camera and an auxiliary external camera, whichprovides an additional observation of the human body from a third-personperspective during training. We present a new egocentric pose estimationmethod, which can be trained on the new dataset with weak external supervision.Specifically, we first generate pseudo labels for the EgoPW dataset with aspatio-temporal optimization method by incorporating the external-viewsupervision. The pseudo labels are then used to train an egocentric poseestimation network. To facilitate the network training, we propose a novellearning strategy to supervise the egocentric features with the high-qualityfeatures extracted by a pretrained external-view pose estimation model. Theexperiments show that our method predicts accurate 3D poses from a singlein-the-wild egocentric image and outperforms the state-of-the-art methods bothquantitatively and qualitatively.