DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders

Human Pose Estimation (HPE) aims at retrieving the 3D position of humanjoints from images or videos. We show that current 3D HPE methods suffer a lackof viewpoint equivariance, namely they tend to fail or perform poorly whendealing with viewpoints unseen at training time. Deep learning methods oftenrely on either scale-invariant, translation-invariant, or rotation-invariantoperations, such as max-pooling. However, the adoption of such procedures doesnot necessarily improve viewpoint generalization, rather leading to moredata-dependent methods. To tackle this issue, we propose a novel capsuleautoencoder network with fast Variational Bayes capsule routing, named DECA. Bymodeling each joint as a capsule entity, combined with the routing algorithm,our approach can preserve the joints' hierarchical and geometrical structure inthe feature space, independently from the viewpoint. By achieving viewpointequivariance, we drastically reduce the network data dependency at trainingtime, resulting in an improved ability to generalize for unseen viewpoints. Inthe experimental validation, we outperform other methods on depth images fromboth seen and unseen viewpoints, both top-view, and front-view. In the RGBdomain, the same network gives state-of-the-art results on the challengingviewpoint transfer task, also establishing a new framework for top-view HPE.The code can be found at https://github.com/mmlab-cv/DECA.