Sapiens: Foundation for Human Vision Models

We present Sapiens, a family of models for four fundamental human-centricvision tasks - 2D pose estimation, body-part segmentation, depth estimation,and surface normal prediction. Our models natively support 1K high-resolutioninference and are extremely easy to adapt for individual tasks by simplyfine-tuning models pretrained on over 300 million in-the-wild human images. Weobserve that, given the same computational budget, self-supervised pretrainingon a curated dataset of human images significantly boosts the performance for adiverse set of human-centric tasks. The resulting models exhibit remarkablegeneralization to in-the-wild data, even when labeled data is scarce orentirely synthetic. Our simple model design also brings scalability - modelperformance across tasks improves as we scale the number of parameters from 0.3to 2 billion. Sapiens consistently surpasses existing baselines across varioushuman-centric benchmarks. We achieve significant improvements over the priorstate-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5%relative angular error.