Command Palette
Search for a command to run...
Deep ViT Features as Dense Visual Descriptors
Deep ViT Features as Dense Visual Descriptors
Shir Amir Yossi Gandelsman Shai Bagon Tali Dekel
Abstract
We study the use of deep features extracted from a pretrained VisionTransformer (ViT) as dense visual descriptors. We observe and empiricallydemonstrate that such features, when extractedfrom a self-supervised ViT model(DINO-ViT), exhibit several striking properties, including: (i) the featuresencode powerful, well-localized semantic information, at high spatialgranularity, such as object parts; (ii) the encoded semantic information isshared across related, yet different object categories, and (iii) positionalbias changes gradually throughout the layers. These properties allow us todesign simple methods for a variety of applications, including co-segmentation,part co-segmentation and semantic correspondences. To distill the power of ViTfeatures from convoluted design choices, we restrict ourselves to lightweightzero-shot methodologies (e.g., binning and clustering) applied directly to thefeatures. Since our methods require no additional training nor data, they arereadily applicable across a variety of domains. We show by extensivequalitative and quantitative evaluation that our simple methodologies achievecompetitive results with recent state-of-the-art supervised methods, andoutperform previous unsupervised methods by a large margin. Code is availablein dino-vit-features.github.io.