Dynamic Graph Reasoning for Multi-person 3D Pose Estimation

Multi-person 3D pose estimation is a challenging task because of occlusionand depth ambiguity, especially in the cases of crowd scenes. To solve theseproblems, most existing methods explore modeling body context cues by enhancingfeature representation with graph neural networks or adding structuralconstraints. However, these methods are not robust for their single-rootformulation that decoding 3D poses from a root node with a pre-defined graph.In this paper, we propose GR-M3D, which models the \textbf{M}ulti-person\textbf{3D} pose estimation with dynamic \textbf{G}raph \textbf{R}easoning. Thedecoding graph in GR-M3D is predicted instead of pre-defined. In particular, Itfirstly generates several data maps and enhances them with a scale and depthaware refinement module (SDAR). Then multiple root keypoints and dense decodingpaths for each person are estimated from these data maps. Based on them,dynamic decoding graphs are built by assigning path weights to the decodingpaths, while the path weights are inferred from those enhanced data maps. Andthis process is named dynamic graph reasoning (DGR). Finally, the 3D poses aredecoded according to dynamic decoding graphs for each detected person. GR-M3Dcan adjust the structure of the decoding graph implicitly by adopting soft pathweights according to input data, which makes the decoding graphs be adaptive todifferent input persons to the best extent and more capable of handlingocclusion and depth ambiguity than previous methods. We empirically show thatthe proposed bottom-up approach even outperforms top-down methods and achievesstate-of-the-art results on three 3D pose datasets.