Command Palette
Search for a command to run...
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
Weixian Lei extsuperscript1,2,3 Yixiao Ge extsuperscript2 extsuperscript† Jianfeng Zhang extsuperscript3 Dylan Sun extsuperscript2 Kun Yi extsuperscript2 Ying Shan extsuperscript2 Mike Zheng Shou extsuperscript1,3 extsuperscript†
Abstract
Though the success of CLIP-based training recipes in vision-language models,their scalability to more modalities (e.g., 3D, audio, etc.) is limited tolarge-scale data, which is expensive or even inapplicable for rare modalities.In this paper, we present ViT-Lens that facilitates efficient omni-modalrepresentation learning by perceiving novel modalities with a pretrained ViTand aligning to a pre-defined space. Specifically, the modality-specific lensis tuned to project multimodal signals to the shared embedding space, which arethen processed by a strong ViT that carries pre-trained image knowledge. Theencoded multimodal representations are optimized toward aligning with themodal-independent space, pre-defined by off-the-shelf foundation models. Awell-trained lens with a ViT backbone has the potential to serve as one ofthese foundation models, supervising the learning of subsequent modalities.ViT-Lens provides a unified solution for representation learning of increasingmodalities with two appealing benefits: (i) Exploiting the pretrained ViTacross tasks and domains effectively with efficient data regime; (ii) Emergentdownstream capabilities of novel modalities are demonstrated due to themodality alignment space. We evaluate ViT-Lens in the context of 3D as aninitial verification. In zero-shot 3D classification, ViT-Lens achievessubstantial improvements over previous state-of-the-art, showing 52.0% accuracyon Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,we enable zero-shot 3D question-answering by simply integrating the trained 3Dlens into the InstructBLIP model without any adaptation. We will release theresults of ViT-Lens on more modalities in the near future.