GenHMR: Generative Human Mesh Recovery

Human mesh recovery (HMR) is crucial in many computer vision applications;from health to arts and entertainment. HMR from monocular images haspredominantly been addressed by deterministic methods that output a singleprediction for a given 2D image. However, HMR from a single image is anill-posed problem due to depth ambiguity and occlusions. Probabilistic methodshave attempted to address this by generating and fusing multiple plausible 3Dreconstructions, but their performance has often lagged behind deterministicapproaches. In this paper, we introduce GenHMR, a novel generative frameworkthat reformulates monocular HMR as an image-conditioned generative task,explicitly modeling and mitigating uncertainties in the 2D-to-3D mappingprocess. GenHMR comprises two key components: (1) a pose tokenizer to convert3D human poses into a sequence of discrete tokens in a latent space, and (2) animage-conditional masked transformer to learn the probabilistic distributionsof the pose tokens, conditioned on the input image prompt along with randomlymasked token sequence. During inference, the model samples from the learnedconditional distribution to iteratively decode high-confidence pose tokens,thereby reducing 3D reconstruction uncertainties. To further refine thereconstruction, a 2D pose-guided refinement technique is proposed to directlyfine-tune the decoded pose tokens in the latent space, which forces theprojected 3D body mesh to align with the 2D pose clues. Experiments onbenchmark datasets demonstrate that GenHMR significantly outperformsstate-of-the-art methods. Project website can be found athttps://m-usamasaleem.github.io/publication/GenHMR/GenHMR.html