Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

3D pose estimation is an invaluable task in computer vision with variouspractical applications. Especially, 3D pose estimation for multi-person from amonocular video (3DMPPE) is particularly challenging and is still largelyuncharted, far from applying to in-the-wild scenarios yet. We pose threeunresolved issues with the existing methods: lack of robustness on unseen viewsduring training, vulnerability to occlusion, and severe jittering in theoutput. As a remedy, we propose POTR-3D, the first realization of asequence-to-sequence 2D-to-3D lifting model for 3DMPPE, powered by a novelgeometry-aware data augmentation strategy, capable of generating unbounded datawith a variety of views while caring about the ground plane and occlusions.Through extensive experiments, we verify that the proposed model and dataaugmentation robustly generalizes to diverse unseen views, robustly recoversthe poses against heavy occlusions, and reliably generates more natural andsmoother outputs. The effectiveness of our approach is verified not only byachieving the state-of-the-art performance on public benchmarks, but also byqualitative results on more challenging in-the-wild videos. Demo videos areavailable at https://www.youtube.com/@potr3d.