Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering

The goal of this paper is to improve the performance of pretrained VisionTransformer (ViT) models, particularly DINOv2, in image clustering task withoutrequiring re-training or fine-tuning. As model size increases, high-normartifacts anomaly appears in the patches of multi-head attention. We observethat this anomaly leads to reduced accuracy in zero-shot image clustering.These artifacts are characterized by disproportionately large values in theattention map compared to other patch tokens. To address these artifacts, wepropose an approach called Inference-Time Attention Engineering (ITAE), whichmanipulates attention function during inference. Specifically, we identify theartifacts by investigating one of the Query-Key-Value (QKV) patches in themulti-head attention and attenuate their corresponding attention values insidethe pretrained models. ITAE shows improved clustering accuracy on multipledatasets by exhibiting more expressive features in latent space. Our findingshighlight the potential of ITAE as a practical solution for reducing artifactsin pretrained ViT models and improving model performance in clustering taskswithout the need for re-training or fine-tuning.