Convolutional Pose Machines

Pose Machines provide a sequential prediction framework for learning richimplicit spatial models. In this work we show a systematic design for howconvolutional networks can be incorporated into the pose machine framework forlearning image features and image-dependent spatial models for the task of poseestimation. The contribution of this paper is to implicitly model long-rangedependencies between variables in structured prediction tasks such asarticulated pose estimation. We achieve this by designing a sequentialarchitecture composed of convolutional networks that directly operate on beliefmaps from previous stages, producing increasingly refined estimates for partlocations, without the need for explicit graphical model-style inference. Ourapproach addresses the characteristic difficulty of vanishing gradients duringtraining by providing a natural learning objective function that enforcesintermediate supervision, thereby replenishing back-propagated gradients andconditioning the learning procedure. We demonstrate state-of-the-artperformance and outperform competing methods on standard benchmarks includingthe MPII, LSP, and FLIC datasets.