4 months ago

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

View Paper Details View Code

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding

Abstract

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: thevision-centric training paradigm and vision-centric framework design. The keyinsight of our vision-centric training paradigm is that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up the vision encoder andprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder, projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model'scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrained vision encoder isadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.