CleanDIFT: Diffusion Features without Noise

Internal features from large-scale pre-trained diffusion models have recentlybeen established as powerful semantic descriptors for a wide range ofdownstream tasks. Works that use these features generally need to add noise toimages before passing them through the model to obtain the semantic features,as the models do not offer the most useful features when given images withlittle to no noise. We show that this noise has a critical impact on theusefulness of these features that cannot be remedied by ensembling withdifferent random noises. We address this issue by introducing a lightweight,unsupervised fine-tuning method that enables diffusion backbones to providehigh-quality, noise-free semantic features. We show that these features readilyoutperform previous diffusion features by a wide margin in a wide variety ofextraction setups and downstream tasks, offering better performance than evenensemble-based methods at a fraction of the cost.