Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

The recovery of multi-person 3D poses from a single RGB image is a severelyill-conditioned problem due to the inherent 2D-3D depth ambiguity, inter-personocclusions, and body truncations. To tackle these issues, recent works haveshown promising results by simultaneously reasoning for different people.However, in most cases this is done by only considering pairwise personinteractions, hindering thus a holistic scene representation able to capturelong-range interactions. This is addressed by approaches that jointly processall people in the scene, although they require defining one of the individualsas a reference and a pre-defined person ordering, being sensitive to thischoice. In this paper, we overcome both these limitations, and we propose anapproach for multi-person 3D pose estimation that captures long-rangeinteractions independently of the input order. For this purpose, we build aresidual-like permutation-invariant network that successfully refinespotentially corrupted initial 3D poses estimated by an off-the-shelf detector.The residual function is learned via Set Transformer blocks, that model theinteractions among all initial poses, no matter their ordering or number. Athorough evaluation demonstrates that our approach is able to boost theperformance of the initially estimated 3D poses by large margins, achievingstate-of-the-art results on standardized benchmarks. Additionally, the proposedmodule works in a computationally efficient manner and can be potentially usedas a drop-in complement for any 3D pose detector in multi-people scenes.