K-means for unsupervised instance segmentation using a self-supervised transformer
Instance segmentation is a fundamental task in computer vision that assigns every pixel to anappropriate class and localizes objects into bounding boxes. However, collecting pixel-level segmentation labels is more resource- and time-consuming than collecting classification and detectionlabels. Herein, we present a novel approach, iterative mask refinement using a self-supervisedtransformer (IMST), which performs class agnostic unsupervised instance segmentation using simple K-means clustering and a self-supervised vision transformer. IMST generates pseudo-ground-truth labels that can be used to train an off-the-shelf instance segmentation model. The pseudo labelsdemonstrate improved performance on multiple datasets. The instance segmentation model trainedon the pseudo labels outperforms state-of-the-art unsupervised instance segmentation methods onCOCO20k (+4.0 average precision (AP)) and COCO val2017(+2.6 AP) without modifications tothe training loss or architecture. We demonstrate that our method can be extended to tasks such assingle/multiple object discovery and supervised fine-tuning instance segmentation while outperforming previous methods.