HyperAIHyperAI

Command Palette

Search for a command to run...

16 days ago

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao Mingxuan Li Silei Wu Linjun Dai Xiaohua Wang Hanming Deng Lewei Lu Dahua Lin Ziwei Liu

From Pixels to Words -- Towards Native Vision-Language Primitives at
  Scale

Abstract

The edifice of native Vision-Language Models (VLMs) has emerged as a risingcontender to typical modular VLMs, shaped by evolving model architectures andtraining paradigms. Yet, two lingering clouds cast shadows over its widespreadexploration and promotion: (-) What fundamental constraints set native VLMsapart from modular ones, and to what extent can these barriers be overcome? (-)How to make research in native VLMs more accessible and democratized, therebyaccelerating progress in the field. In this paper, we clarify these challengesand outline guiding principles for constructing native VLMs. Specifically, onenative VLM primitive should: (i) effectively align pixel and wordrepresentations within a shared semantic space; (ii) seamlessly integrate thestrengths of formerly separate vision and language modules; (iii) inherentlyembody various cross-modal properties that support unified vision-languageencoding, aligning, and reasoning. Hence, we launch NEO, a novel family ofnative VLMs built from first principles, capable of rivaling top-tier modularcounterparts across diverse real-world scenarios. With only 390M image-textexamples, NEO efficiently develops visual perception from scratch whilemitigating vision-language conflicts inside a dense and monolithic modelcrafted from our elaborate primitives. We position NEO as a cornerstone forscalable and powerful native VLMs, paired with a rich set of reusablecomponents that foster a cost-effective and extensible ecosystem. Our code andmodels are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale | Papers | HyperAI