Aya Vision: Advancing the Frontier of Multilingual Multimodality

Building multimodal language models is fundamentally challenging: it requiresaligning vision and language modalities, curating high-quality instructiondata, and avoiding the degradation of existing text-only capabilities oncevision is introduced. These difficulties are further magnified in themultilingual setting, where the need for multimodal data in different languagesexacerbates existing data scarcity, machine translation often distorts meaning,and catastrophic forgetting is more pronounced. To address the aforementionedchallenges, we introduce novel techniques spanning both data and modeling.First, we develop a synthetic annotation framework that curates high-quality,diverse multilingual multimodal instruction data, enabling Aya Vision models toproduce natural, human-preferred responses to multimodal inputs across manylanguages. Complementing this, we propose a cross-modal model merging techniquethat mitigates catastrophic forgetting, effectively preserving text-onlycapabilities while simultaneously enhancing multimodal generative performance.Aya-Vision-8B achieves best-in-class performance compared to strong multimodalmodels such as Qwen-2.5-VL-7B, Pixtral-12B, and even much largerLlama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, whichoutperforms models more than twice its size, such as Molmo-72B andLLaMA-3.2-90B-Vision. Our work advances multilingual progress on themulti-modal frontier, and provides insights into techniques that effectivelybend the need for compute while delivering extremely high performance.