HyperAIHyperAI

Command Palette

Search for a command to run...

Multimodal Autoregressive Pre-training of Large Vision Encoders

Abstract

We introduce a novel method for pre-training of large-scale vision encoders.Building on recent advancements in autoregressive pre-training of visionmodels, we extend this framework to a multimodal setting, i.e., images andtext. In this paper, we present AIMV2, a family of generalist vision encoderscharacterized by a straightforward pre-training process, scalability, andremarkable performance across a range of downstream tasks. This is achieved bypairing the vision encoder with a multimodal decoder that autoregressivelygenerates raw image patches and text tokens. Our encoders excel not only inmultimodal evaluations but also in vision benchmarks such as localization,grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistentlyoutperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) inmultimodal image understanding across diverse settings.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp