3D Human Pose Perception from Egocentric Stereo Videos

While head-mounted devices are becoming more compact, they provide egocentricviews with significant self-occlusions of the device user. Hence, existingmethods often fail to accurately estimate complex 3D poses from egocentricviews. In this work, we propose a new transformer-based framework to improveegocentric stereo 3D human pose estimation, which leverages the sceneinformation and temporal context of egocentric stereo videos. Specifically, weutilize 1) depth features from our 3D scene reconstruction module withuniformly sampled windows of egocentric stereo frames, and 2) human jointqueries enhanced by temporal features of the video inputs. Our method is ableto accurately estimate human poses even in challenging scenarios, such ascrouching and sitting. Furthermore, we introduce two new benchmark datasets,i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer amuch larger number of egocentric stereo views with a wider variety of humanmotions than the existing datasets, allowing comprehensive evaluation ofexisting and upcoming methods. Our extensive experiments show that the proposedapproach significantly outperforms previous methods. We will releaseUnrealEgo2, UnrealEgo-RW, and trained models on our project page.