4 months ago

Abstract

The edifice of native Vision-Language Models (VLMs) has emerged as a risingcontender to typical modular VLMs, shaped by evolving model architectures andtraining paradigms. Yet, two lingering clouds cast shadows over its widespreadexploration and promotion: (-) What fundamental constraints set native VLMsapart from modular ones, and to what extent can these barriers be overcome? (-)How to make research in native VLMs more accessible and democratized, therebyaccelerating progress in the field. In this paper, we clarify these challengesand outline guiding principles for constructing native VLMs. Specifically, onenative VLM primitive should: (i) effectively align pixel and wordrepresentations within a shared semantic space; (ii) seamlessly integrate thestrengths of formerly separate vision and language modules; (iii) inherentlyembody various cross-modal properties that support unified vision-languageencoding, aligning, and reasoning. Hence, we launch NEO, a novel family ofnative VLMs built from first principles, capable of rivaling top-tier modularcounterparts across diverse real-world scenarios. With only 390M image-textexamples, NEO efficiently develops visual perception from scratch whilemitigating vision-language conflicts inside a dense and monolithic modelcrafted from our elaborate primitives. We position NEO as a cornerstone forscalable and powerful native VLMs, paired with a rich set of reusablecomponents that foster a cost-effective and extensible ecosystem. Our code andmodels are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

Source PDF