Real-time Human-Centric Segmentation for Complex Video Scenes

Most existing video tasks related to "human" focus on the segmentation ofsalient humans, ignoring the unspecified others in the video. Few studies havefocused on segmenting and tracking all humans in a complex video, includingpedestrians and humans of other states (e.g., seated, riding, or occluded). Inthis paper, we propose a novel framework, abbreviated as HVISNet, that segmentsand tracks all presented people in given videos based on a one-stage detector.To better evaluate complex scenes, we offer a new benchmark called HVIS (HumanVideo Instance Segmentation), which comprises 1447 human instance masks in 805high-resolution videos in diverse scenes. Extensive experiments show that ourproposed HVISNet outperforms the state-of-the-art methods in terms of accuracyat a real-time inference speed (30 FPS), especially on complex video scenes. Wealso notice that using the center of the bounding box to distinguish differentindividuals severely deteriorates the segmentation accuracy, especially inheavily occluded conditions. This common phenomenon is referred to as theambiguous positive samples problem. To alleviate this problem, we propose amechanism named Inner Center Sampling to improve the accuracy of instancesegmentation. Such a plug-and-play inner center sampling mechanism can beincorporated in any instance segmentation models based on a one-stage detectorto improve the performance. In particular, it gains 4.1 mAP improvement on thestate-of-the-art method in the case of occluded humans. Code and data areavailable at https://github.com/IIGROUP/HVISNet.