Interacting Hand-Object Pose Estimation via Dense Mutual Attention

3D hand-object pose estimation is the key to the success of many computervision applications. The main focus of this task is to effectively model theinteraction between the hand and an object. To this end, existing works eitherrely on interaction constraints in a computationally-expensive iterativeoptimization, or consider only a sparse correlation between sampled hand andobject keypoints. In contrast, we propose a novel dense mutual attentionmechanism that is able to model fine-grained dependencies between the hand andthe object. Specifically, we first construct the hand and object graphsaccording to their mesh structures. For each hand node, we aggregate featuresfrom every object node by the learned attention and vice versa for each objectnode. Thanks to such dense mutual attention, our method is able to producephysically plausible poses with high quality and real-time inference speed.Extensive quantitative and qualitative experiments on large benchmark datasetsshow that our method outperforms state-of-the-art methods. The code isavailable at https://github.com/rongakowang/DenseMutualAttention.git.