DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

We introduce DiscoBox, a novel framework that jointly learns instancesegmentation and semantic correspondence using bounding box supervision.Specifically, we propose a self-ensembling framework where instancesegmentation and semantic correspondence are jointly guided by a structuredteacher in addition to the bounding box supervision. The teacher is astructured energy model incorporating a pairwise potential and a cross-imagepotential to model the pairwise pixel relationships both within and across theboxes. Minimizing the teacher energy simultaneously yields refined object masksand dense correspondences between intra-class objects, which are taken aspseudo-labels to supervise the task network and provide positive/negativecorrespondence pairs for dense constrastive learning. We show a symbioticrelationship where the two tasks mutually benefit from each other. Our bestmodel achieves 37.9% AP on COCO instance segmentation, surpassing prior weaklysupervised methods and is competitive to supervised methods. We also obtainstate of the art weakly supervised results on PASCAL VOC12 and PF-PASCAL withreal-time inference.