HMP: Hand Motion Priors for Pose and Shape Estimation from Video

Understanding how humans interact with the world necessitates accurate 3Dhand pose estimation, a task complicated by the hand's high degree ofarticulation, frequent occlusions, self-occlusions, and rapid motions. Whilemost existing methods rely on single-image inputs, videos have useful cues toaddress aforementioned issues. However, existing video-based 3D hand datasetsare insufficient for training feedforward models to generalize to in-the-wildscenarios. On the other hand, we have access to large human motion capturedatasets which also include hand motions, e.g. AMASS. Therefore, we develop agenerative motion prior specific for hands, trained on the AMASS dataset whichfeatures diverse and high-quality hand motions. This motion prior is thenemployed for video-based 3D hand motion estimation following a latentoptimization approach. Our integration of a robust motion prior significantlyenhances performance, especially in occluded scenarios. It produces stable,temporally consistent results that surpass conventional single-frame methods.We demonstrate our method's efficacy via qualitative and quantitativeevaluations on the HO3D and DexYCB datasets, with special emphasis on anocclusion-focused subset of HO3D. Code is available athttps://hmp.is.tue.mpg.de