HyperAIHyperAI
2 months ago

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Lei, Weixian ; Ge, Yixiao ; Zhang, Jianfeng ; Sun, Dylan ; Yi, Kun ; Shan, Ying ; Shou, Mike Zheng
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
Abstract

Though the success of CLIP-based training recipes in vision-language models,their scalability to more modalities (e.g., 3D, audio, etc.) is limited tolarge-scale data, which is expensive or even inapplicable for rare modalities.In this paper, we present ViT-Lens that facilitates efficient omni-modalrepresentation learning by perceiving novel modalities with a pretrained ViTand aligning to a pre-defined space. Specifically, the modality-specific lensis tuned to project multimodal signals to the shared embedding space, which arethen processed by a strong ViT that carries pre-trained image knowledge. Theencoded multimodal representations are optimized toward aligning with themodal-independent space, pre-defined by off-the-shelf foundation models. Awell-trained lens with a ViT backbone has the potential to serve as one ofthese foundation models, supervising the learning of subsequent modalities.ViT-Lens provides a unified solution for representation learning of increasingmodalities with two appealing benefits: (i) Exploiting the pretrained ViTacross tasks and domains effectively with efficient data regime; (ii) Emergentdownstream capabilities of novel modalities are demonstrated due to themodality alignment space. We evaluate ViT-Lens in the context of 3D as aninitial verification. In zero-shot 3D classification, ViT-Lens achievessubstantial improvements over previous state-of-the-art, showing 52.0% accuracyon Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,we enable zero-shot 3D question-answering by simply integrating the trained 3Dlens into the InstructBLIP model without any adaptation. We will release theresults of ViT-Lens on more modalities in the near future.

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights | Latest Papers | HyperAI