Occlusion-Aware Instance Segmentation via BiLayer Network Architectures

Segmenting highly-overlapping image objects is challenging, because there istypically no distinction between real object contours and occlusion boundarieson images. Unlike previous instance segmentation methods, we model imageformation as a composition of two overlapping layers, and propose BilayerConvolutional Network (BCNet), where the top layer detects occluding objects(occluders) and the bottom layer infers partially occluded instances(occludees). The explicit modeling of occlusion relationship with bilayerstructure naturally decouples the boundaries of both the occluding and occludedinstances, and considers the interaction between them during mask regression.We investigate the efficacy of bilayer structure using two popularconvolutional network designs, namely, Fully Convolutional Network (FCN) andGraph Convolutional Network (GCN). Further, we formulate bilayer decouplingusing the vision transformer (ViT), by representing instances in the image asseparate learnable occluder and occludee queries. Large and consistentimprovements using one/two-stage and query-based object detectors with variousbackbones and network layer choices validate the generalization ability ofbilayer decoupling, as shown by extensive experiments on image instancesegmentation benchmarks (COCO, KINS, COCOA) and video instance segmentationbenchmarks (YTVIS, OVIS, BDD100K MOTS), especially for heavy occlusion cases.Code and data are available at https://github.com/lkeab/BCNet.