Capturing and Inferring Dense Full-Body Human-Scene Contact

Inferring human-scene contact (HSC) is the first step toward understandinghow humans interact with their surroundings. While detecting 2D human-objectinteraction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyedsignificant progress, reasoning about 3D human-scene contact from a singleimage is still challenging. Existing HSC detection methods consider only a fewtypes of predefined contact, often reduce body and scene to a small number ofprimitives, and even overlook image evidence. To predict human-scene contactfrom a single image, we address the limitations above from both data andalgorithmic perspectives. We capture a new dataset called RICH for "Realscenes, Interaction, Contact and Humans." RICH contains multiviewoutdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodiescaptured using markerless motion capture, 3D body scans, and high resolution 3Dscene scans. A key feature of RICH is that it also contains accuratevertex-level contact labels on the body. Using RICH, we train a network thatpredicts dense body-scene contacts from a single RGB image. Our key insight isthat regions in contact are always occluded so the network needs the ability toexplore the whole image for evidence. We use a transformer to learn suchnon-local relationships and propose a new Body-Scene contact TRansfOrmer(BSTRO). Very few methods explore 3D contact; those that do focus on the feetonly, detect foot contact as a post-processing step, or infer contact from bodypose without looking at the scene. To our knowledge, BSTRO is the first methodto directly estimate 3D body-scene contact from a single image. We demonstratethat BSTRO significantly outperforms the prior art. The code and dataset areavailable at https://rich.is.tue.mpg.de.