SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

A long-standing goal of 3D human reconstruction is to create lifelike andfully detailed 3D humans from single-view images. The main challenge lies ininferring unknown body shapes, appearances, and clothing details in areas notvisible in the images. To address this, we propose SiTH, a novel pipeline thatuniquely integrates an image-conditioned diffusion model into a 3D meshreconstruction workflow. At the core of our method lies the decomposition ofthe challenging single-view reconstruction problem into generativehallucination and reconstruction subproblems. For the former, we employ apowerful generative diffusion model to hallucinate unseen back-view appearancebased on the input images. For the latter, we leverage skinned body meshes asguidance to recover full-body texture meshes from the input and back-viewimages. SiTH requires as few as 500 3D human scans for training whilemaintaining its generality and robustness to diverse images. Extensiveevaluations on two 3D human benchmarks, including our newly created one,highlighted our method's superior accuracy and perceptual quality in 3Dtextured human reconstruction. Our code and evaluation benchmark are availableat https://ait.ethz.ch/sith