Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation

We propose a robust and accurate method for estimating the 3D poses of twohands in close interaction from a single color image. This is a verychallenging problem, as large occlusions and many confusions between the jointsmay happen. State-of-the-art methods solve this problem by regressing a heatmapfor each joint, which requires solving two problems simultaneously: localizingthe joints and recognizing them. In this work, we propose to separate thesetasks by relying on a CNN to first localize joints as 2D keypoints, and onself-attention between the CNN features at these keypoints to associate themwith the corresponding hand joint. The resulting architecture, which we call"Keypoint Transformer", is highly efficient as it achieves state-of-the-artperformance with roughly half the number of model parameters on theInterHand2.6M dataset. We also show it can be easily extended to estimate the3D pose of an object manipulated by one or two hands with high performance.Moreover, we created a new dataset of more than 75,000 images of two handsmanipulating an object fully annotated in 3D and will make it publiclyavailable.