Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

Despite the recent progress, 3D multi-person pose estimation from monocularvideos is still challenging due to the commonly encountered problem of missinginformation caused by occlusion, partially out-of-frame target persons, andinaccurate person detection. To tackle this problem, we propose a novelframework integrating graph convolutional networks (GCNs) and temporalconvolutional networks (TCNs) to robustly estimate camera-centric multi-person3D poses that do not require camera parameters. In particular, we introduce ahuman-joint GCN, which, unlike the existing GCN, is based on a directed graphthat employs the 2D pose estimator's confidence scores to improve the poseestimation results. We also introduce a human-bone GCN, which models the boneconnections and provides more information beyond human joints. The two GCNswork together to estimate the spatial frame-wise 3D poses and can make use ofboth visible joint and bone information in the target frame to estimate theoccluded or missing human-part information. To further refine the 3D poseestimation, we use our temporal convolutional networks (TCNs) to enforce thetemporal and human-dynamics constraints. We use a joint-TCN to estimateperson-centric 3D poses across frames, and propose a velocity-TCN to estimatethe speed of 3D joints to ensure the consistency of the 3D pose estimation inconsecutive frames. Finally, to estimate the 3D human poses for multiplepersons, we propose a root-TCN that estimates camera-centric 3D poses withoutrequiring camera parameters. Quantitative and qualitative evaluationsdemonstrate the effectiveness of the proposed method.