LongVie 2:マルチモーダル制御可能超長動画ワールドモデル
LongVie 2:マルチモーダル制御可能超長動画ワールドモデル
Jianxiong Gao Zhaoxi Chen Xian Liu Junhao Zhuang Chengming Xu Jianfeng Feng Yu Qiao Yanwei Fu Chenyang Si Ziwei Liu
Abstract
事前学習済みの動画生成システムに基づいて動画世界モデル(video world model)を構築することは、汎用的な時空間知能(general spatiotemporal intelligence)へ向かう上で重要かつ挑戦的なステップである。世界モデルは、制御性(controllability)、長期的な視覚的品質(long-term visual quality)、時系列的一貫性(temporal consistency)の3つの本質的特性を備える必要がある。本研究では、段階的アプローチを採用し、まず制御性を強化した後、長期的かつ高品質な生成へと拡張する。そこで、3段階の学習プロセスで訓練されたエンドツーエンドの自己回帰型フレームワーク「LongVie 2」を提案する。第1段階は「多モーダル制御(Multi-modal guidance)」であり、密な制御信号と疎な制御信号を統合することで、世界レベルの暗黙的监督を提供し、制御性を向上させる。第2段階は「入力フレームに対する劣化を意識した学習(degradation-aware training)」であり、学習段階と長期推論段階のギャップを埋め、視覚的品質の維持を図る。第3段階は「履歴文脈制御(history-context guidance)」であり、隣接するクリップ間の文脈情報を整合させることで、時系列的一貫性を確保する。さらに、多様な現実世界および合成環境をカバーする100本の高解像度1分間動画から構成される包括的なベンチマーク「LongVGenBench」を導入した。広範な実験により、LongVie 2が長期的な制御性、時系列的一貫性、視覚的忠実性において最先端の性能を達成し、最大5分間にわたる連続動画生成を実現できることを示した。これは、統一的な動画世界モデル構築に向けた重要な一歩である。
One-sentence Summary
The authors from FDU, NJU, NTU, NVIDIA, THU, and Shanghai AI Laboratory propose LongVie 2, an end-to-end autoregressive video world model that achieves controllable, ultra-long video generation up to five minutes by integrating multi-modal guidance, degradation-aware training, and history-context modeling—enabling superior long-term consistency and visual fidelity compared to prior work.
Key Contributions
-
LongVie 2 addresses the challenge of building controllable, long-horizon video world models by extending pretrained diffusion backbones with a progressive three-stage training framework that enhances controllability, temporal consistency, and visual fidelity—key properties for realistic spatiotemporal modeling.
-
The method introduces multi-modal guidance using both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals to provide implicit world-level supervision, degradation-aware training to simulate long-term inference conditions and maintain visual quality, and history-context guidance to align adjacent clips and ensure long-range temporal coherence.
-
LongVie 2 is evaluated on LongVGenBench, a new benchmark of 100 high-resolution one-minute videos spanning diverse real-world and synthetic environments, where it achieves state-of-the-art performance in controllability, temporal consistency, and visual fidelity, supporting continuous video generation up to five minutes.
Introduction
The authors leverage pretrained video diffusion models to build LongVie 2, a controllable ultra-long video world model capable of generating 3–5 minute videos with high visual fidelity and temporal consistency. This work addresses two key challenges in prior video world models: limited controllability—often restricted to low-level or localized inputs—and fragile long-term coherence, where visual quality degrades and temporal drift emerges over extended sequences. To overcome these, the authors introduce a three-stage training framework: first, they integrate multi-modal guidance using both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals to enable global, semantic-level scene manipulation; second, they employ a degradation-aware training strategy that applies controlled distortions to early frames, bridging the gap between short training clips and long-horizon inference; third, they incorporate history context by feeding preceding frames as input, ensuring stable temporal evolution across generated segments. These innovations enable LongVie 2 to produce highly consistent, controllable, and visually realistic long videos, setting a new standard for scalable video world modeling.
Dataset
- The dataset for Stages 1 and 2 comprises approximately 60,000 videos drawn from three sources: ACID and ACID-Large for aerial drone footage of coastlines and natural landscapes; Vchitect_T2V_DataVerse, a corpus of over 14 million high-quality Internet videos with detailed textual annotations; and MovieNet, containing 1,100 full-length movies across genres, regions, and decades.
- All videos are converted into 81-frame clips sampled at 16 fps to ensure consistent temporal resolution and stable training.
- For Stage 3, which focuses on long-horizon modeling, the authors use long-form videos from OmniWorld and SpatialVID. From each video, an 81-frame target segment is extracted starting at the 20th frame, with all preceding frames used as history context. A total of 40,000 such segments are randomly selected to form the Stage 3 training split.
- To ensure temporal coherence, the authors apply PySceneDetect to detect and remove scene transitions, splitting raw videos into transition-free segments. Each segment is then uniformly sampled at 16 fps and truncated to 81 frames.
- For each 81-frame clip, the authors generate a rich set of control signals: depth maps via Video Depth Anything, point trajectories using SpatialTracker, and descriptive captions via Qwen-2.5-VL-7B.
- The final curated dataset consists of approximately 100,000 video-control signal pairs, forming the unified training foundation for LongVie.
- The evaluation dataset LongVGenBench is used to assess controllability and long-term coherence. It includes diverse real-world and synthetic scenes lasting at least one minute at 1080p resolution or higher, with varied camera motions.
- For inference, each LongVGenBench video is split into overlapping 81-frame clips with one-frame overlap, and corresponding captions and control signals are extracted to construct the test input data.
Method
The authors leverage a multi-modal control injection framework to enhance the controllability of long-form video generation. The core architecture, LongVie 2, is built upon a pre-trained DiT backbone, which is kept frozen to preserve its prior knowledge. To incorporate control signals, the model duplicates the initial 12 layers of the base DiT, creating two lightweight, trainable branches: a dense branch for processing depth maps and a sparse branch for handling point maps. These branches, denoted as FD(⋅;θD) and FP(⋅;θP), are designed to process their respective encoded control inputs, cD and cP. The control signals are injected into the main generation path through zero-initialized linear layers, ϕl, which ensure that the control influence starts at zero and gradually increases during training, preventing disruption to the model's initial behavior. The overall computation in the l-th controlled DiT block is defined as
zl=Fl(zl−1)+ϕl(FDl(cDl−1)+FPl(cPl−1)),where Fl represents the frozen base DiT block. This design allows the model to leverage both the detailed structural information from depth maps and the high-level semantic cues from point trajectories to form an implicit world representation.

To address the inherent imbalance where dense control signals tend to dominate the generation process, the authors propose a degradation-based training strategy. This strategy weakens the influence of dense signals through two complementary mechanisms. At the feature level, with a probability α, the latent representation of the dense control is randomly scaled by a factor λ sampled from a uniform distribution [0.05,1]. At the data level, with probability β, the dense control tensor undergoes degradation using two techniques: Random Scale Fusion, which creates a multi-scale, weighted sum of downsampled versions of the input, and Adaptive Blur Augmentation, which applies a randomly sized average blur to reduce sharpness. These degradations are designed to mitigate over-reliance on dense signals and encourage the model to learn a more balanced integration of both modalities. During this pretraining stage, only the parameters of the control branches and the fusion layers ϕl are updated, while the backbone remains frozen.

The training process is structured as a three-stage autoregressive framework. The first stage, Clean Pretraining, focuses on establishing a strong foundation for controllability by training the model with clean, un-degraded inputs. The second stage, Degradation Tuning, introduces a degradation-aware strategy to bridge the domain gap between training and long-horizon inference. This stage intentionally degrades the first image of each clip to simulate the quality decay that occurs during long-term generation. The degradation operator T(⋅) is defined as a probabilistic combination of two mechanisms: Encoding degradation, which simulates VAE-induced corruption by repeatedly encoding and decoding the image, and Generation degradation, which simulates diffusion-based degradation by adding noise and denoising the latent representation. This stage improves visual quality but introduces a new challenge of temporal inconsistency. The final stage, History-Aware Refinement, addresses this by introducing history context guidance. During this stage, the model is conditioned on the latent representations of the NH preceding history frames, zH, which are obtained by encoding the history frames. To align with the degraded inputs encountered during inference, the history frames are also degraded using the same operator T(⋅) before encoding. The model is trained to generate the next clip conditioned on the initial frame, the history context, and the control signals, with the goal of maintaining temporal coherence.

To further enhance temporal consistency, the authors introduce two training-free strategies. The first is Unified Noise Initialization, which maintains a single shared noise instance across all video clips, providing a coherent stochastic prior that strengthens temporal continuity. The second is Global Normalization, which ensures a consistent depth scale across clips by computing the 5th and 95th percentiles of all pixel values in the full video and using them to clip and linearly scale the depth values to the range [0,1]. This strategy is robust to outliers and prevents temporal discontinuities caused by independent normalization. The model configuration details include setting the feature-level degradation probability α to 15% and the data-level degradation probability β to 10%. The degradation strategies are gradually introduced during training, with all strategies disabled for the first 2000 iterations and gradually activated over the final 1000 iterations. The dense and sparse branches are initialized using a "half-copy" method, where pretrained weights are interleaved and the feature dimensionality is halved, providing a stable starting point for joint learning.
Experiment
- LongVie 2 is evaluated on LongVGenBench, a new benchmark of 100 high-resolution one-minute videos covering diverse real-world and synthetic environments, validating its long-range controllability, temporal coherence, and visual fidelity.
- On LongVGenBench, LongVie 2 achieves state-of-the-art performance with 58.47% Aesthetic Quality, 69.77% Imaging Quality, 0.529 SSIM, 0.295 LPIPS, 91.05% Subject Consistency, 92.45% Background Consistency, 23.37% Overall Consistency, and 82.95% Dynamic Degree, surpassing all baselines including Wan2.1, Go-With-The-Flow, DAS, Hunyuan-GameCraft, and Matrix-Game.
- Human evaluation with 60 participants confirms LongVie 2 outperforms all baselines across Visual Quality, Prompt-Video Consistency, Condition Consistency, Color Consistency, and Temporal Consistency, achieving the highest average scores in all categories.
- Ablation studies demonstrate that each training stage—Control Learning, Degradation-aware Training, and History-context Guidance—progressively improves controllability, visual quality, and long-term consistency, with the full model achieving the best results.
- LongVie 2 successfully generates continuous videos up to five minutes in length, maintaining high visual fidelity, structural stability, motion coherence, and style consistency across diverse scenarios, including subject-driven and subject-free sequences with seasonal style transfers.
The authors use an ablation study to evaluate the impact of global normalization and unified initial noise on the performance of LongVie 2. Results show that removing either component leads to a noticeable drop in visual quality, controllability, and temporal consistency, indicating that both are essential for maintaining high-quality and consistent long video generation.

Results show that LongVie 2 achieves the highest scores across all human evaluation metrics, including Visual Quality, Prompt-Video Consistency, Condition Consistency, Color Consistency, and Temporal Consistency, demonstrating superior perceptual quality and controllability compared to baseline models.

Results show that LongVie 2 achieves state-of-the-art performance across all evaluated metrics, outperforming existing baselines in visual quality, controllability, and temporal consistency. The model demonstrates superior results in aesthetic quality, imaging quality, SSIM, LPIPS, and all temporal consistency measures, confirming its effectiveness in long-range controllable video generation.

The authors use a staged training approach to progressively enhance controllability, visual quality, and temporal consistency in LongVie 2. Results show that each stage contributes incrementally, with the final model achieving state-of-the-art performance across all metrics, particularly in long-term temporal coherence and visual fidelity.

Results show that the degradation training strategy in LongVie 2 significantly improves video quality, controllability, and temporal consistency. Adding both encoding and generation degradation leads to the best performance, with the combined approach achieving the highest scores across all metrics, demonstrating the complementary benefits of these two degradation types.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.