Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction

Modeling hand-object manipulations is essential for understanding how humansinteract with their environment. While of practical importance, estimating thepose of hands and objects during interactions is challenging due to the largemutual occlusions that occur during manipulation. Recent efforts have beendirected towards fully-supervised methods that require large amounts of labeledtraining samples. Collecting 3D ground-truth data for hand-object interactions,however, is costly, tedious, and error-prone. To overcome this challenge wepresent a method to leverage photometric consistency across time whenannotations are only available for a sparse subset of frames in a video. Ourmodel is trained end-to-end on color images to jointly reconstruct hands andobjects in 3D by inferring their poses. Given our estimated reconstructions, wedifferentiably render the optical flow between pairs of adjacent images and useit within the network to warp one frame to another. We then apply aself-supervised photometric loss that relies on the visual consistency betweennearby images. We achieve state-of-the-art results on 3D hand-objectreconstruction benchmarks and demonstrate that our approach allows us toimprove the pose estimation accuracy by leveraging information from neighboringframes in low-data regimes.