HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Human hands are highly articulated and versatile at handling objects. Jointlyestimating the 3D poses of a hand and the object it manipulates from amonocular camera is challenging due to frequent occlusions. Thus, existingmethods often rely on intermediate 3D shape representations to increaseperformance. These representations are typically explicit, such as 3D pointclouds or meshes, and thus provide information in the direct surroundings ofthe intermediate hand pose estimate. To address this, we introduce HOISDF, aSigned Distance Field (SDF) guided hand-object pose estimation network, whichjointly exploits hand and object SDFs to provide a global, implicitrepresentation over the complete reconstruction volume. Specifically, the roleof the SDFs is threefold: equip the visual encoder with implicit shapeinformation, help to encode hand-object interactions, and guide the hand andobject pose regression via SDF-based sampling and by augmenting the featurerepresentations. We show that HOISDF achieves state-of-the-art results onhand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is availableat https://github.com/amathislab/HOISDF