A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Text-to-image diffusion models have made significant advances in generatingand editing high-quality images. As a result, numerous approaches have exploredthe ability of diffusion model features to understand and process single imagesfor downstream tasks, e.g., classification, semantic segmentation, andstylization. However, significantly less is known about what these featuresreveal across multiple, different images and objects. In this work, we exploitStable Diffusion (SD) features for semantic and dense correspondence anddiscover that with simple post-processing, SD features can performquantitatively similar to SOTA representations. Interestingly, the qualitativeanalysis reveals that SD features have very different properties compared toexisting representation learning features, such as the recently releasedDINOv2: while DINOv2 provides sparse but accurate matches, SD features providehigh-quality spatial information but sometimes inaccurate semantic matches. Wedemonstrate that a simple fusion of these two features works surprisingly well,and a zero-shot evaluation using nearest neighbors on these fused featuresprovides a significant performance gain over state-of-the-art methods onbenchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show thatthese correspondences can enable interesting applications such as instanceswapping in two images.