GraPix: Exploring Graph Modularity Optimization for Unsupervised Pixel Clustering
A vision transformer learns high-quality patch embeddings during the self-supervised training, which plays a crucial role in many unsupervised downstream tasks like object localization, object detection, and sparse semantic segmentation. Such downstream tasks explore the various properties of the patch affinity graph to achieve state-of-the-art performance in an unsupervised setting. However, the true potential of the patch affinity graph is yet to be harnessed for the dense semantic segmentation task. The existing works show that modularity is an essential property of a graph, which reflects the strength of the existing graph partitions. We argue that joint optimization of feature clustering in patch embedding space and graph modularity in node attribute space leads to a smooth training convergence and achieves better results. In this paper, we propose a novel end-to-end unsupervised learning method called GraPix, which utilizes the hidden property of patch embeddings extracted from a self-supervised vision transformer for the dense semantic segmentation task. The GraPix constructs an affinity graph based on patch similarities in their embedding space. Next, it learns highly discriminative centroid embeddings for dense semantic segmentation with our novel joint feature clustering and graph modularity optimization objective. The experiment results show that GraPix outperforms the state-of-the-art method on the SUIM dataset and achieves the second-best performance on the Cityscapes dataset. Also, we perform a detailed ablation to justify the choice of model components and hyper-parameters. The code is available at https://github.com/SonalKumar95/GraPix.