Image Reconstruction as a Tool for Feature Analysis

Vision encoders are increasingly used in modern applications, fromvision-only models to multimodal systems such as vision-language models.Despite their remarkable success, it remains unclear how these architecturesrepresent features internally. Here, we propose a novel approach forinterpreting vision features via image reconstruction. We compare two relatedmodel families, SigLIP and SigLIP2, which differ only in their trainingobjective, and show that encoders pre-trained on image-based tasks retainsignificantly more image information than those trained on non-image tasks suchas contrastive learning. We further apply our method to a range of visionencoders, ranking them by the informativeness of their feature representations.Finally, we demonstrate that manipulating the feature space yields predictablechanges in reconstructed images, revealing that orthogonal rotations (ratherthan spatial transformations) control color encoding. Our approach can beapplied to any vision encoder, shedding light on the inner structure of itsfeature space. The code and model weights to reproduce the experiments areavailable in GitHub.