Scene Synthesis from Human Motion

Large-scale capture of human motion with diverse, complex scenes, whileimmensely useful, is often considered prohibitively costly. Meanwhile, humanmotion alone contains rich information about the scene they reside in andinteract with. For example, a sitting human suggests the existence of a chair,and their leg position further implies the chair's pose. In this paper, wepropose to synthesize diverse, semantically reasonable, and physicallyplausible scenes based on human motion. Our framework, Scene Synthesis fromHUMan MotiON (SUMMON), includes two steps. It first uses ContactFormer, ournewly introduced contact predictor, to obtain temporally consistent contactlabels from human motion. Based on these predictions, SUMMON then choosesinteracting objects and optimizes physical plausibility losses; it furtherpopulates the scene with objects that do not interact with humans. Experimentalresults demonstrate that SUMMON synthesizes feasible, plausible, and diversescenes and has the potential to generate extensive human-scene interaction datafor the community.