a month ago

Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

View Paper Details View Code

Abstract

We introduce a full-stack framework that scales up reasoning invision-language models (VLMs) to long videos, leveraging reinforcementlearning. We address the unique challenges of long video reasoning byintegrating three critical components: (1) a large-scale dataset,LongVideo-Reason, comprising 52K long video QA pairs with high-qualityreasoning annotations across diverse domains such as sports, games, and vlogs;(2) a two-stage training pipeline that extends VLMs with chain-of-thoughtsupervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) atraining infrastructure for long video RL, named Multi-modal ReinforcementSequence Parallelism (MR-SP), which incorporates sequence parallelism and avLLM-based engine tailored for long video, using cached video embeddings forefficient rollout and prefilling. In experiments, LongVILA-R1-7B achievesstrong performance on long video QA benchmarks such as VideoMME. It alsooutperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporalreasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning onour LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistentperformance gains as the number of input video frames scales. LongVILA-R1 marksa firm step towards long video reasoning in VLMs. In addition, we release ourtraining system for public availability that supports RL training on variousmodalities (video, text, and audio), various models (VILA and Qwen series), andeven image and video generation models. On a single A100 node (8 GPUs), itsupports RL training on hour-long videos (e.g., 3,600 frames / around 256ktokens).