Holistic, Instance-Level Human Parsing

Object parsing -- the task of decomposing an object into its semantic parts-- has traditionally been formulated as a category-level segmentation problem.Consequently, when there are multiple objects in an image, current methodscannot count the number of objects in the scene, nor can they determine whichpart belongs to which object. We address this problem by segmenting the partsof objects at an instance-level, such that each pixel in the image is assigneda part label, as well as the identity of the object it belongs to. Moreover, weshow how this approach benefits us in obtaining segmentations at coarsergranularities as well. Our proposed network is trained end-to-end givendetections, and begins with a category-level segmentation module. Thereafter, adifferentiable Conditional Random Field, defined over a variable number ofinstances for every input image, reasons about the identity of each part byassociating it with a human detection. In contrast to other approaches, ourmethod can handle the varying number of people in each image and our holisticnetwork produces state-of-the-art results in instance-level part and humansegmentation, together with competitive results in category-level partsegmentation, all achieved by a single forward-pass through our neural network.