Let Go of Your Labels with Unsupervised Transfer

Foundation vision-language models have enabled remarkable zero-shottransferability of the pre-trained representations to a wide range ofdownstream tasks. However, to solve a new task, zero-shot transfer stillnecessitates human guidance to define visual categories that appear in thedata. Here, we show that fully unsupervised transfer emerges when searching forthe labeling of a dataset that induces maximal margin classifiers inrepresentation spaces of different foundation models. We present TURTLE, afully unsupervised method that effectively employs this guiding principle touncover the underlying labeling of a downstream dataset without any supervisionand task-specific representation learning. We evaluate TURTLE on a diversebenchmark suite of 26 datasets and show that it achieves new state-of-the-artunsupervised performance. Furthermore, TURTLE, although being fullyunsupervised, outperforms zero-shot transfer baselines on a wide range ofdatasets. In particular, TURTLE matches the average performance of CLIPzero-shot on 26 datasets by employing the same representation space, spanning awide range of architectures and model sizes. By guiding the search for theunderlying labeling using the representation spaces of two foundation models,TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines,demonstrating the surprising power and effectiveness of unsupervised transfer.