NVILA: Efficient Frontier Visual Language Models

Visual language models (VLMs) have made significant advances in accuracy inrecent years. However, their efficiency has received much less attention. Thispaper introduces NVILA, a family of open VLMs designed to optimize bothefficiency and accuracy. Building on top of VILA, we improve its modelarchitecture by first scaling up the spatial and temporal resolutions, and thencompressing visual tokens. This "scale-then-compress" approach enables NVILA toefficiently process high-resolution images and long videos. We also conduct asystematic investigation to enhance the efficiency of NVILA throughout itsentire lifecycle, from training and fine-tuning to deployment. NVILA matches orsurpasses the accuracy of many leading open and proprietary VLMs across a widerange of image and video benchmarks. At the same time, it reduces trainingcosts by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code andmodels available to facilitate reproducibility.