HyperAIHyperAI
3 months ago

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu
NVILA: Efficient Frontier Visual Language Models
Abstract

Visual language models (VLMs) have made significant advances in accuracy inrecent years. However, their efficiency has received much less attention. Thispaper introduces NVILA, a family of open VLMs designed to optimize bothefficiency and accuracy. Building on top of VILA, we improve its modelarchitecture by first scaling up the spatial and temporal resolutions, and thencompressing visual tokens. This "scale-then-compress" approach enables NVILA toefficiently process high-resolution images and long videos. We also conduct asystematic investigation to enhance the efficiency of NVILA throughout itsentire lifecycle, from training and fine-tuning to deployment. NVILA matches orsurpasses the accuracy of many leading open and proprietary VLMs across a widerange of image and video benchmarks. At the same time, it reduces trainingcosts by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code andmodels available to facilitate reproducibility.