7 days ago

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu

Abstract

Controllable ultra-long video generation is a fundamental yet challengingtask. Although existing methods are effective for short clips, they struggle toscale due to issues such as temporal inconsistency and visual degradation. Inthis paper, we initially investigate and identify three key factors: separatenoise initialization, independent control signal normalization, and thelimitations of single-modality guidance. To address these issues, we proposeLongVie, an end-to-end autoregressive framework for controllable long videogeneration. LongVie introduces two core designs to ensure temporal consistency:1) a unified noise initialization strategy that maintains consistent generationacross clips, and 2) global control signal normalization that enforcesalignment in the control space throughout the entire video. To mitigate visualdegradation, LongVie employs 3) a multi-modal control framework that integratesboth dense (e.g., depth maps) and sparse (e.g., keypoints) control signals,complemented by 4) a degradation-aware training strategy that adaptivelybalances modality contributions over time to preserve visual quality. We alsointroduce LongVGenBench, a comprehensive benchmark consisting of 100high-resolution videos spanning diverse real-world and synthetic environments,each lasting over one minute. Extensive experiments show that LongVie achievesstate-of-the-art performance in long-range controllability, consistency, andquality.