Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

We present EgoTAP, a heatmap-to-3D pose lifting method for highly accuratestereo egocentric 3D pose estimation. Severe self-occlusion and out-of-viewlimbs in egocentric camera views make accurate pose estimation a challengingproblem. To address the challenge, prior methods employ jointheatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3Dpose conversion still remains an inaccurate process. We propose a novelheatmap-to-3D lifting method composed of the Grid ViT Encoder and thePropagation Network. The Grid ViT Encoder summarizes joint heatmaps intoeffective feature embedding using self-attention. Then, the Propagation Networkestimates the 3D pose by utilizing skeletal information to better estimate theposition of obscure joints. Our method significantly outperforms the previousstate-of-the-art qualitatively and quantitatively demonstrated by a 23.9\%reduction of error in an MPJPE metric. Our source code is available in GitHub.