LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Long-context capability is critical for multi-modal foundation models. Weintroduce LongVILA, a full-stack solution for long-context vision-languagemodels, including system, model training, and dataset development. On thesystem side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP)system that enables long-context training and inference, enabling 2M contextlength training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x fasterthan Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM intext-only settings. Moreover, it seamlessly integrates with Hugging FaceTransformers. For model training, we propose a five-stage pipeline comprisingalignment, pre-training, context extension, and long-short joint supervisedfine-tuning. Regarding datasets, we meticulously construct large-scale visuallanguage pre-training datasets and long video instruction-following datasets tosupport our multi-stage training process. The full-stack solution extends thefeasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) andimproves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5%accuracy in 1400-frames video (274k context length) needle in a haystack.LongVILA-8B also demonstrates a consistent improvement in performance on longvideos within the VideoMME benchmark as the video frames increase.