Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

We present Multi-HMR, a strong sigle-shot model for multi-person 3D humanmesh recovery from a single RGB image. Predictions encompass the whole body,i.e., including hands and facial expressions, using the SMPL-X parametric modeland 3D location in the camera coordinate system. Our model detects people bypredicting coarse 2D heatmaps of person locations, using features produced by astandard Vision Transformer (ViT) backbone. It then predicts their whole-bodypose, shape and 3D location using a new cross-attention module called the HumanPrediction Head (HPH), with one query attending to the entire set of featuresfor each detected person. As direct prediction of fine-grained hands and facialposes in a single shot, i.e., without relying on explicit crops around bodyparts, is hard to learn from existing data, we introduce CUFFS, the Close-UpFrames of Full-Body Subjects dataset, containing humans close to the camerawith diverse hand poses. We show that incorporating it into the training datafurther enhances predictions, particularly for hands. Multi-HMR also optionallyaccounts for camera intrinsics, if available, by encoding camera ray directionsfor each image token. This simple design achieves strong performance onwhole-body and body-only benchmarks simultaneously: a ViT-S backbone on$448{\times}448$ images already yields a fast and competitive model, whilelarger models and higher resolutions obtain state-of-the-art results.