a day ago

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

Yu Lu, Yi Yang

Abstract

Recent advances in video generation models have enabled high-quality shortvideo generation from text prompts. However, extending these models to longervideos remains a significant challenge, primarily due to degraded temporalconsistency and visual fidelity. Our preliminary observations show that naivelyapplying short-video generation models to longer sequences leads to noticeablequality degradation. Further analysis identifies a systematic trend wherehigh-frequency components become increasingly distorted as video length grows,an issue we term high-frequency distortion. To address this, we proposeFreeLong, a training-free framework designed to balance the frequencydistribution of long video features during the denoising process. FreeLongachieves this by blending global low-frequency features, which capture holisticsemantics across the full video, with local high-frequency features extractedfrom short temporal windows to preserve fine details. Building on this,FreeLong++ extends FreeLong dual-branch design into a multi-branch architecturewith multiple attention branches, each operating at a distinct temporal scale.By arranging multiple window sizes from global to local, FreeLong++ enablesmulti-band frequency fusion from low to high frequencies, ensuring bothsemantic continuity and fine-grained motion dynamics across longer videosequences. Without any additional training, FreeLong++ can be plugged intoexisting video generation models (e.g. Wan2.1 and LTX-Video) to produce longervideos with substantially improved temporal consistency and visual fidelity. Wedemonstrate that our approach outperforms previous methods on longer videogeneration tasks (e.g. 4x and 8x of native length). It also supports coherentmulti-prompt video generation with smooth scene transitions and enablescontrollable video generation using long depth or pose sequences.