17 days ago

Multimodal Autoregressive Pre-training of Large Vision Encoders

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby

View Paper Details

Multimodal Autoregressive Pre-training of Large Vision Encoders

Abstract

We introduce a novel method for pre-training of large-scale vision encoders.Building on recent advancements in autoregressive pre-training of visionmodels, we extend this framework to a multimodal setting, i.e., images andtext. In this paper, we present AIMV2, a family of generalist vision encoderscharacterized by a straightforward pre-training process, scalability, andremarkable performance across a range of downstream tasks. This is achieved bypairing the vision encoder with a multimodal decoder that autoregressivelygenerates raw image patches and text tokens. Our encoders excel not only inmultimodal evaluations but also in vision benchmarks such as localization,grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistentlyoutperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) inmultimodal image understanding across diverse settings.