8 months ago

Multimodal Representation

Method/Architecture

Weixian Lei extsuperscript1,2,3 Yixiao Ge extsuperscript2 extsuperscript† Jianfeng Zhang extsuperscript3 Dylan Sun extsuperscript2 Kun Yi extsuperscript2 Ying Shan extsuperscript2 Mike Zheng Shou extsuperscript1,3 extsuperscript†

Abstract

Though the success of CLIP-based training recipes in vision-language models,their scalability to more modalities (e.g., 3D, audio, etc.) is limited tolarge-scale data, which is expensive or even inapplicable for rare modalities.In this paper, we present ViT-Lens that facilitates efficient omni-modalrepresentation learning by perceiving novel modalities with a pretrained ViTand aligning to a pre-defined space. Specifically, the modality-specific lensis tuned to project multimodal signals to the shared embedding space, which arethen processed by a strong ViT that carries pre-trained image knowledge. Theencoded multimodal representations are optimized toward aligning with themodal-independent space, pre-defined by off-the-shelf foundation models. Awell-trained lens with a ViT backbone has the potential to serve as one ofthese foundation models, supervising the learning of subsequent modalities.ViT-Lens provides a unified solution for representation learning of increasingmodalities with two appealing benefits: (i) Exploiting the pretrained ViTacross tasks and domains effectively with efficient data regime; (ii) Emergentdownstream capabilities of novel modalities are demonstrated due to themodality alignment space. We evaluate ViT-Lens in the context of 3D as aninitial verification. In zero-shot 3D classification, ViT-Lens achievessubstantial improvements over previous state-of-the-art, showing 52.0% accuracyon Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,we enable zero-shot 3D question-answering by simply integrating the trained 3Dlens into the InstructBLIP model without any adaptation. We will release theresults of ViT-Lens on more modalities in the near future.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Method/Architecture

Weixian Lei extsuperscript1,2,3 Yixiao Ge extsuperscript2 extsuperscript† Jianfeng Zhang extsuperscript3 Dylan Sun extsuperscript2 Kun Yi extsuperscript2 Ying Shan extsuperscript2 Mike Zheng Shou extsuperscript1,3 extsuperscript†

Abstract

Though the success of CLIP-based training recipes in vision-language models,their scalability to more modalities (e.g., 3D, audio, etc.) is limited tolarge-scale data, which is expensive or even inapplicable for rare modalities.In this paper, we present ViT-Lens that facilitates efficient omni-modalrepresentation learning by perceiving novel modalities with a pretrained ViTand aligning to a pre-defined space. Specifically, the modality-specific lensis tuned to project multimodal signals to the shared embedding space, which arethen processed by a strong ViT that carries pre-trained image knowledge. Theencoded multimodal representations are optimized toward aligning with themodal-independent space, pre-defined by off-the-shelf foundation models. Awell-trained lens with a ViT backbone has the potential to serve as one ofthese foundation models, supervising the learning of subsequent modalities.ViT-Lens provides a unified solution for representation learning of increasingmodalities with two appealing benefits: (i) Exploiting the pretrained ViTacross tasks and domains effectively with efficient data regime; (ii) Emergentdownstream capabilities of novel modalities are demonstrated due to themodality alignment space. We evaluate ViT-Lens in the context of 3D as aninitial verification. In zero-shot 3D classification, ViT-Lens achievessubstantial improvements over previous state-of-the-art, showing 52.0% accuracyon Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,we enable zero-shot 3D question-answering by simply integrating the trained 3Dlens into the InstructBLIP model without any adaptation. We will release theresults of ViT-Lens on more modalities in the near future.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights | Papers | HyperAI