LongVie 2: نموذج عالم فيديو طويل جدًا قابل للتحكم متعدد الوسائط
LongVie 2: نموذج عالم فيديو طويل جدًا قابل للتحكم متعدد الوسائط
Jianxiong Gao Zhaoxi Chen Xian Liu Junhao Zhuang Chengming Xu Jianfeng Feng Yu Qiao Yanwei Fu Chenyang Si Ziwei Liu
Abstract
تمثّل بناء نماذج عالم فيديو على أنظمة توليد فيديو مُدرّبة مسبقًا خطوة مهمة ولكنها صعبة نحو تحقيق ذكاء فراغي-زمني عام. يجب أن تمتلك نموذج العالم ثلاث خصائص أساسية: القدرة على التحكّم، والجودة البصرية الطويلة المدى، والاتساق الزمني. ولتحقيق هذا الهدف، نتبع نهجًا تدريجيًا: نُحسّن أولاً القدرة على التحكّم، ثم نمدد النموذج نحو توليد فيديو طويل المدى وعالي الجودة. نقدّم "LongVie 2"، وهي إطار عمل ذاتي-تسلسلي من الطرف إلى الطرف، تم تدريبه على ثلاث مراحل: (1) التوجيه متعدد الوسائط، الذي يدمج إشارات تحكم كثيفة ونادرة لتقديم مراقبة على مستوى العالم بشكل غير مباشر وتحسين القدرة على التحكّم؛ (2) التدريب المُدرَك للانحطاط في الإطار المُدخل، الذي يُقلّل الفجوة بين التدريب والاستنتاج الطويل المدى، ويُحافظ على جودة بصرية عالية؛ (3) التوجيه القائم على السياق التاريخي، الذي يُنسق المعلومات السياقية بين المقاطع المجاورة لضمان الاتساق الزمني. كما نقدّم بشكل إضافي "LongVGenBench"، وهو معيار شامل يتكوّن من 100 فيديو بجودة عالية وطول دقيقة واحدة، تغطي بيئات حقيقية ومُصطنعة متنوعة. تُظهر التجارب الواسعة أن LongVie 2 تحقق أداءً متقدمًا على مستوى التقنيات الحالية فيما يتعلق بالتحكم على مدى طويل، والاتساق الزمني، والولاء البصري، كما تدعم توليد فيديو مستمر يصل إلى خمس دقائق، مما يُمثّل خطوة كبيرة نحو نمذجة عالم الفيديو الموحّدة.
One-sentence Summary
The authors from FDU, NJU, NTU, NVIDIA, THU, and Shanghai AI Laboratory propose LongVie 2, an end-to-end autoregressive video world model that achieves controllable, ultra-long video generation up to five minutes by integrating multi-modal guidance, degradation-aware training, and history-context modeling—enabling superior long-term consistency and visual fidelity compared to prior work.
Key Contributions
-
LongVie 2 addresses the challenge of building controllable, long-horizon video world models by extending pretrained diffusion backbones with a progressive three-stage training framework that enhances controllability, temporal consistency, and visual fidelity—key properties for realistic spatiotemporal modeling.
-
The method introduces multi-modal guidance using both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals to provide implicit world-level supervision, degradation-aware training to simulate long-term inference conditions and maintain visual quality, and history-context guidance to align adjacent clips and ensure long-range temporal coherence.
-
LongVie 2 is evaluated on LongVGenBench, a new benchmark of 100 high-resolution one-minute videos spanning diverse real-world and synthetic environments, where it achieves state-of-the-art performance in controllability, temporal consistency, and visual fidelity, supporting continuous video generation up to five minutes.
Introduction
The authors leverage pretrained video diffusion models to build LongVie 2, a controllable ultra-long video world model capable of generating 3–5 minute videos with high visual fidelity and temporal consistency. This work addresses two key challenges in prior video world models: limited controllability—often restricted to low-level or localized inputs—and fragile long-term coherence, where visual quality degrades and temporal drift emerges over extended sequences. To overcome these, the authors introduce a three-stage training framework: first, they integrate multi-modal guidance using both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals to enable global, semantic-level scene manipulation; second, they employ a degradation-aware training strategy that applies controlled distortions to early frames, bridging the gap between short training clips and long-horizon inference; third, they incorporate history context by feeding preceding frames as input, ensuring stable temporal evolution across generated segments. These innovations enable LongVie 2 to produce highly consistent, controllable, and visually realistic long videos, setting a new standard for scalable video world modeling.
Dataset
- The dataset for Stages 1 and 2 comprises approximately 60,000 videos drawn from three sources: ACID and ACID-Large for aerial drone footage of coastlines and natural landscapes; Vchitect_T2V_DataVerse, a corpus of over 14 million high-quality Internet videos with detailed textual annotations; and MovieNet, containing 1,100 full-length movies across genres, regions, and decades.
- All videos are converted into 81-frame clips sampled at 16 fps to ensure consistent temporal resolution and stable training.
- For Stage 3, which focuses on long-horizon modeling, the authors use long-form videos from OmniWorld and SpatialVID. From each video, an 81-frame target segment is extracted starting at the 20th frame, with all preceding frames used as history context. A total of 40,000 such segments are randomly selected to form the Stage 3 training split.
- To ensure temporal coherence, the authors apply PySceneDetect to detect and remove scene transitions, splitting raw videos into transition-free segments. Each segment is then uniformly sampled at 16 fps and truncated to 81 frames.
- For each 81-frame clip, the authors generate a rich set of control signals: depth maps via Video Depth Anything, point trajectories using SpatialTracker, and descriptive captions via Qwen-2.5-VL-7B.
- The final curated dataset consists of approximately 100,000 video-control signal pairs, forming the unified training foundation for LongVie.
- The evaluation dataset LongVGenBench is used to assess controllability and long-term coherence. It includes diverse real-world and synthetic scenes lasting at least one minute at 1080p resolution or higher, with varied camera motions.
- For inference, each LongVGenBench video is split into overlapping 81-frame clips with one-frame overlap, and corresponding captions and control signals are extracted to construct the test input data.
Method
The authors leverage a multi-modal control injection framework to enhance the controllability of long-form video generation. The core architecture, LongVie 2, is built upon a pre-trained DiT backbone, which is kept frozen to preserve its prior knowledge. To incorporate control signals, the model duplicates the initial 12 layers of the base DiT, creating two lightweight, trainable branches: a dense branch for processing depth maps and a sparse branch for handling point maps. These branches, denoted as FD(⋅;θD) and FP(⋅;θP), are designed to process their respective encoded control inputs, cD and cP. The control signals are injected into the main generation path through zero-initialized linear layers, ϕl, which ensure that the control influence starts at zero and gradually increases during training, preventing disruption to the model's initial behavior. The overall computation in the l-th controlled DiT block is defined as
zl=Fl(zl−1)+ϕl(FDl(cDl−1)+FPl(cPl−1)),where Fl represents the frozen base DiT block. This design allows the model to leverage both the detailed structural information from depth maps and the high-level semantic cues from point trajectories to form an implicit world representation.

To address the inherent imbalance where dense control signals tend to dominate the generation process, the authors propose a degradation-based training strategy. This strategy weakens the influence of dense signals through two complementary mechanisms. At the feature level, with a probability α, the latent representation of the dense control is randomly scaled by a factor λ sampled from a uniform distribution [0.05,1]. At the data level, with probability β, the dense control tensor undergoes degradation using two techniques: Random Scale Fusion, which creates a multi-scale, weighted sum of downsampled versions of the input, and Adaptive Blur Augmentation, which applies a randomly sized average blur to reduce sharpness. These degradations are designed to mitigate over-reliance on dense signals and encourage the model to learn a more balanced integration of both modalities. During this pretraining stage, only the parameters of the control branches and the fusion layers ϕl are updated, while the backbone remains frozen.

The training process is structured as a three-stage autoregressive framework. The first stage, Clean Pretraining, focuses on establishing a strong foundation for controllability by training the model with clean, un-degraded inputs. The second stage, Degradation Tuning, introduces a degradation-aware strategy to bridge the domain gap between training and long-horizon inference. This stage intentionally degrades the first image of each clip to simulate the quality decay that occurs during long-term generation. The degradation operator T(⋅) is defined as a probabilistic combination of two mechanisms: Encoding degradation, which simulates VAE-induced corruption by repeatedly encoding and decoding the image, and Generation degradation, which simulates diffusion-based degradation by adding noise and denoising the latent representation. This stage improves visual quality but introduces a new challenge of temporal inconsistency. The final stage, History-Aware Refinement, addresses this by introducing history context guidance. During this stage, the model is conditioned on the latent representations of the NH preceding history frames, zH, which are obtained by encoding the history frames. To align with the degraded inputs encountered during inference, the history frames are also degraded using the same operator T(⋅) before encoding. The model is trained to generate the next clip conditioned on the initial frame, the history context, and the control signals, with the goal of maintaining temporal coherence.

To further enhance temporal consistency, the authors introduce two training-free strategies. The first is Unified Noise Initialization, which maintains a single shared noise instance across all video clips, providing a coherent stochastic prior that strengthens temporal continuity. The second is Global Normalization, which ensures a consistent depth scale across clips by computing the 5th and 95th percentiles of all pixel values in the full video and using them to clip and linearly scale the depth values to the range [0,1]. This strategy is robust to outliers and prevents temporal discontinuities caused by independent normalization. The model configuration details include setting the feature-level degradation probability α to 15% and the data-level degradation probability β to 10%. The degradation strategies are gradually introduced during training, with all strategies disabled for the first 2000 iterations and gradually activated over the final 1000 iterations. The dense and sparse branches are initialized using a "half-copy" method, where pretrained weights are interleaved and the feature dimensionality is halved, providing a stable starting point for joint learning.
Experiment
- LongVie 2 is evaluated on LongVGenBench, a new benchmark of 100 high-resolution one-minute videos covering diverse real-world and synthetic environments, validating its long-range controllability, temporal coherence, and visual fidelity.
- On LongVGenBench, LongVie 2 achieves state-of-the-art performance with 58.47% Aesthetic Quality, 69.77% Imaging Quality, 0.529 SSIM, 0.295 LPIPS, 91.05% Subject Consistency, 92.45% Background Consistency, 23.37% Overall Consistency, and 82.95% Dynamic Degree, surpassing all baselines including Wan2.1, Go-With-The-Flow, DAS, Hunyuan-GameCraft, and Matrix-Game.
- Human evaluation with 60 participants confirms LongVie 2 outperforms all baselines across Visual Quality, Prompt-Video Consistency, Condition Consistency, Color Consistency, and Temporal Consistency, achieving the highest average scores in all categories.
- Ablation studies demonstrate that each training stage—Control Learning, Degradation-aware Training, and History-context Guidance—progressively improves controllability, visual quality, and long-term consistency, with the full model achieving the best results.
- LongVie 2 successfully generates continuous videos up to five minutes in length, maintaining high visual fidelity, structural stability, motion coherence, and style consistency across diverse scenarios, including subject-driven and subject-free sequences with seasonal style transfers.
The authors use an ablation study to evaluate the impact of global normalization and unified initial noise on the performance of LongVie 2. Results show that removing either component leads to a noticeable drop in visual quality, controllability, and temporal consistency, indicating that both are essential for maintaining high-quality and consistent long video generation.

Results show that LongVie 2 achieves the highest scores across all human evaluation metrics, including Visual Quality, Prompt-Video Consistency, Condition Consistency, Color Consistency, and Temporal Consistency, demonstrating superior perceptual quality and controllability compared to baseline models.

Results show that LongVie 2 achieves state-of-the-art performance across all evaluated metrics, outperforming existing baselines in visual quality, controllability, and temporal consistency. The model demonstrates superior results in aesthetic quality, imaging quality, SSIM, LPIPS, and all temporal consistency measures, confirming its effectiveness in long-range controllable video generation.

The authors use a staged training approach to progressively enhance controllability, visual quality, and temporal consistency in LongVie 2. Results show that each stage contributes incrementally, with the final model achieving state-of-the-art performance across all metrics, particularly in long-term temporal coherence and visual fidelity.

Results show that the degradation training strategy in LongVie 2 significantly improves video quality, controllability, and temporal consistency. Adding both encoding and generation degradation leads to the best performance, with the combined approach achieving the highest scores across all metrics, demonstrating the complementary benefits of these two degradation types.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.