Single-stage Multi-human Parsing via Point Sets and Center-based Offsets

This work studies the multi-human parsing problem. Existing methods, eitherfollowing top-down or bottom-up two-stage paradigms, usually involve expensivecomputational costs. We instead present a high-performance Single-stageMulti-human Parsing (SMP) deep architecture that decouples the multi-humanparsing problem into two fine-grained sub-problems, i.e., locating the humanbody and parts. SMP leverages the point features in the barycenter positions toobtain their segmentation and then generates a series of offsets from thebarycenter of the human body to the barycenters of parts, thus performing humanbody and parts matching without the grouping process. Within the SMParchitecture, we propose a Refined Feature Retain module to extract the globalfeature of instances through generated mask attention and a Mask of InterestReclassify module as a trainable plug-in module to refine the classificationresults with the predicted segmentation. Extensive experiments on the MHPv2.0dataset demonstrate the best effectiveness and efficiency of the proposedmethod, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% inAPvolp, and 1.2% in PCP50. In particular, the proposed method requires fewertraining epochs and a less complex model architecture. We will release oursource codes, pretrained models, and online demos to facilitate furtherstudies.