Multimodal Autoregressive Pre-training of Large Vision Encoders

We introduce a novel method for pre-training of large-scale vision encoders.Building on recent advancements in autoregressive pre-training of visionmodels, we extend this framework to a multimodal setting, i.e., images andtext. In this paper, we present AIMV2, a family of generalist vision encoderscharacterized by a straightforward pre-training process, scalability, andremarkable performance across a range of downstream tasks. This is achieved bypairing the vision encoder with a multimodal decoder that autoregressivelygenerates raw image patches and text tokens. Our encoders excel not only inmultimodal evaluations but also in vision benchmarks such as localization,grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistentlyoutperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) inmultimodal image understanding across diverse settings.