Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation

In this paper, we propose a pose grammar to tackle the problem of 3D humanpose estimation. Our model directly takes 2D pose as input and learns ageneralized 2D-3D mapping function. The proposed model consists of a basenetwork which efficiently captures pose-aligned features and a hierarchy ofBi-directional RNNs (BRNN) on the top to explicitly incorporate a set ofknowledge regarding human body configuration (i.e., kinematics, symmetry, motorcoordination). The proposed model thus enforces high-level constraints overhuman poses. In learning, we develop a pose sample simulator to augmenttraining samples in virtual camera views, which further improves our modelgeneralizability. We validate our method on public 3D human pose benchmarks andpropose a new evaluation protocol working on cross-view setting to verify thegeneralization capability of different methods. We empirically observe thatmost state-of-the-art methods encounter difficulty under such setting while ourmethod can well handle such challenges.