HyperAIHyperAI

Command Palette

Search for a command to run...

LongVie 2 : Modèle mondial ultra-long contrôlable multimodal

Jianxiong Gao Zhaoxi Chen Xian Liu Junhao Zhuang Chengming Xu Jianfeng Feng Yu Qiao Yanwei Fu Chenyang Si Ziwei Liu

Résumé

La construction de modèles mondiaux vidéo à partir de systèmes préentraînés de génération vidéo représente une étape importante mais complexe vers l’intelligence spatio-temporelle générale. Un modèle mondial doit posséder trois propriétés essentielles : la contrôlabilité, la qualité visuelle à long terme et la cohérence temporelle. À cette fin, nous adoptons une approche progressive — d’abord améliorer la contrôlabilité, puis étendre vers une génération à long terme et de haute qualité. Nous présentons LongVie 2, un cadre autoregressif end-to-end entraîné en trois étapes : (1) une guidance multimodale, qui intègre des signaux de contrôle denses et épars pour fournir une supervision implicite au niveau du monde et améliorer la contrôlabilité ; (2) un entraînement conscient des dégradations sur les cadres d’entrée, qui réduit l’écart entre l’entraînement et l’inférence à long terme afin de préserver une haute qualité visuelle ; et (3) une guidance contextuelle historique, qui aligne les informations contextuelles entre clips adjacents pour garantir la cohérence temporelle. Nous introduisons également LongVGenBench, un benchmark complet composé de 100 vidéos haute résolution d’une minute couvrant des environnements réels et synthétiques diversifiés. Des expériences étendues démontrent que LongVie 2 atteint des performances de pointe en matière de contrôlabilité à longue portée, de cohérence temporelle et de fidélité visuelle, tout en soutenant la génération vidéo continue sur une durée pouvant atteindre cinq minutes, marquant ainsi une avancée significative vers une modélisation vidéo mondiale unifiée.

One-sentence Summary

Fudan University, Nanyang Technological University, and Shanghai AI Laboratory researchers propose LongVie 2, an autoregressive video world model generating controllable 3–5 minute videos. It introduces multi-modal guidance for dense/sparse control, a degradation-aware training strategy to bridge training-inference gaps, and history-context modeling for temporal consistency, significantly advancing long-range video generation fidelity and controllability over prior approaches.

Key Contributions

  • LongVie 2 addresses the critical limitations in current video world models, which suffer from restricted semantic-level controllability and temporal degradation when generating videos beyond one minute. It introduces a progressive framework to unify fine-grained control with long-horizon stability for scalable world modeling.
  • The method employs a three-stage training approach: integrating dense and sparse control signals for enhanced controllability, applying degradation-aware training to maintain visual quality during long inference, and using history-context guidance to ensure temporal consistency across extended sequences. This end-to-end autoregressive framework systematically bridges short-clip generation to minute-long coherent outputs.
  • Evaluated on LongVGenBench—a rigorous benchmark of 100 diverse one-minute high-resolution videos—LongVie 2 achieves state-of-the-art results in controllability, temporal coherence, and visual fidelity while supporting continuous generation up to five minutes, demonstrating significant advancement toward unified video world models.

Introduction

Recent video diffusion models like Sora and Kling have enabled photorealistic text-to-video generation, but research now prioritizes video world models that simulate controllable physical environments for applications like virtual training and interactive media. However, existing world models suffer from limited semantic-level controllability—they cannot manipulate entire scenes coherently—and fail to maintain visual quality or temporal consistency beyond one-minute durations due to drift and degradation. The authors address this by extending pretrained diffusion backbones into LongVie 2, a framework trained through three progressive stages: multi-modal guidance for structural control, degradation-aware training to bridge short-clip and long-horizon inference gaps, and history context guidance for long-range coherence. This approach achieves minute-long controllable video generation while introducing LongVGenBench, a benchmark of 100 one-minute videos for rigorous evaluation of long-horizon fidelity.

Dataset

The authors use a multi-stage training approach with distinct datasets and processing pipelines:

  • Composition and sources:
    Stages 1–2 train on ~60,000 videos from three sources: ACID/ACID-Large (drone footage of coastlines/landscapes), Vchitect_T2V_DataVerse (14M+ internet videos with text annotations), and MovieNet (1,100 full-length movies). Stage 3 uses long-form videos from OmniWorld and SpatialVID for temporal modeling. The evaluation benchmark LongVGenBench contains diverse 1+ minute, 1080p+ videos.

  • Subset details:

    • Stages 1–2 data: Unified into 81-frame clips at 16 fps. ACID ensures RealEstate10K-compatible metadata; MovieNet provides complex scenes.
    • Stage 3 data: Processes long videos into 81-frame target segments starting at frame 20, using all preceding frames as history context. The training split comprises 40,000 randomly selected segments.
    • LongVGenBench: Split into 81-frame clips with one-frame overlap for evaluation, each paired with captions and control signals.
  • Data usage:
    Stages 1–2 train on the full 60,000-video corpus. Stage 3 exclusively uses the 40,000-segment split with history context. For LongVGenBench evaluation, short-clip captions and control signals guide inference.

  • Processing details:
    All training videos undergo strict pre-processing: scene transitions are removed via PySceneDetect, yielding transition-free clips. Each clip is sampled at 16 fps, truncated to 81 frames, and augmented with depth maps (Video Depth Anything), point trajectories (SpatialTracker), and captions (Qwen-2.5-VL-7B). This creates a final curated set of ~100,000 video-control pairs for training.

Method

The authors leverage a three-stage training framework to build LongVie 2, an autoregressive video world model capable of generating controllable, temporally consistent videos up to 3–5 minutes in duration. The architecture integrates multi-modal control signals, degradation-aware training, and history-context modeling to bridge the gap between short-clip training and long-horizon inference.

The overall framework, as shown in the figure below, begins with an input image and corresponding dense (depth) and sparse (point trajectory) control signals that provide world-level guidance. These modalities are processed through a modified DiT backbone that injects control features additively into the generation stream via zero-initialized linear layers, preserving the stability of the pre-trained base model while enabling fine-grained conditioning.

In Stage I, the model is initialized with clean pretraining using standard ControlNet-style conditioning. The authors construct a Multi-Modal Control DiT by duplicating the first 12 layers of the pre-trained Wan DiT and splitting each into two trainable branches—one for dense control (FD\mathcal{F}_{\mathrm{D}}FD) and one for sparse control (FP\mathcal{F}_{\mathrm{P}}FP). These branches process their respective encoded control signals cDc_{\mathrm{D}}cD and cPc_{\mathrm{P}}cP, and their outputs are fused into the frozen base DiT stream via zero-initialized linear layers ϕl\phi^{l}ϕl, ensuring no initial interference with the pretrained weights. The computation at layer lll is defined as:

z^{l} = \mathcal{F}^{l}(z^{l-1}) + \phi^{l}( \mathcal{F}_{\mathrm{D}}^{l}(c_{\mathrm{D}}^{l-1}) + \mathcal{F}_{\mathrm{P}}^{l}(c_{\mathrm{P}}^{l-1}) ) $$, where $\mathcal{F}^{l}$ denotes the frozen base block. To prevent dense signals from dominating, the authors introduce feature-level and data-level degradation during training. Feature-level degradation scales the dense latent representation by a random factor $\lambda \in [0.05, 1]$ with probability $\alpha$, reformulating the above equation as:

z^{l} = \mathcal{F}^{l}(z^{l-1}) + \phi^{l}( \lambda \cdot \mathcal{F}{\mathrm{D}}^{l}(c{\mathrm{D}}^{l-1}) + \mathcal{F}{\mathrm{P}}^{l}(c{\mathrm{P}}^{l-1}) )

Data-level degradation applies Random Scale Fusion and Adaptive Blur Augmentation to the dense input tensor, enhancing robustness to spatial variation and reducing overfitting to local depth details. In Stage II, the authors address the domain gap between clean training inputs and degraded inference inputs by introducing a first-frame degradation strategy. As shown in the figure below, two degradation mechanisms are applied: encoding degradation, which simulates VAE-induced corruption via $K$ repeated encode-decode cycles, and generation degradation, which adds Gaussian noise to the latent representation at a random timestep $t < 15$ and then denoises it. The degradation operator $\mathcal{T}(I)$ is defined as:

\mathcal{T}(I) = \left{ \begin{array}{ll} (\mathcal{D} \circ \mathcal{E})^K (I) & \text{w.p.}~0.2 \ \mathcal{D} \big( \Phi_0 ( \sqrt{\alpha_t} \mathcal{E}(I) + \sqrt{1-\alpha_t} \epsilon ) \big) & \text{w.p.}~0.8 \end{array} \right.

where $\epsilon \sim \mathcal{N}(0, \mathbf{I})$. This degradation is applied with probability $\alpha$ during training, with milder degradations occurring more frequently to simulate the gradual quality decay observed in long-horizon generation. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/monkeyocr/images/462a90ba2ff7866d51b3d875cfa4a5f4bc67a6f706aa935a6a52e7564170ddd4.jpg) In Stage III, the model is refined with history-context guidance to enforce temporal consistency across clips. The authors encode the last $N_H$ frames of the preceding clip into latent space using the VAE encoder $\mathcal{E}(\cdot)$, apply the same degradation operator $\mathcal{T}(\cdot)$ to these frames, and then encode the degraded versions to obtain $\tilde{z}_H$. The model is trained to generate the next clip conditioned on the initial frame latent $z_I$, the history latent $\tilde{z}_H$, and the control signals $c_D$ and $c_P$, as formulated by:

z_t = \mathcal{F}(z_{t+1} \mid z_I, z_H, c_D, c_P)

To stabilize the boundary between clips, the authors assign exponentially increasing weights to the first three generated frames and introduce three regularization losses: history context consistency $\mathcal{L}_{\mathrm{cons}} = \| z_{H}^{-1} - \hat{z}^{0} \|^{2}$, degradation consistency $\mathcal{L}_{\mathrm{deg}} = \| \mathcal{F}_{\mathrm{lp}}(\tilde{z}_{I}^{0}) - \mathcal{F}_{\mathrm{lp}}(\hat{z}^{0}) \|^{2}$, and ground-truth high-frequency alignment $\mathcal{L}_{\mathrm{gt}} = \| \mathcal{F}_{\mathrm{hp}}(z_{\mathrm{gt}}^{0}) - \mathcal{F}_{\mathrm{hp}}(\hat{z}^{0}) \|^{2}$. The final temporal regularization objective is:

\mathcal{L}{\mathrm{temp}} = \lambda{\mathrm{deg}} \mathcal{L}{\mathrm{deg}} + \lambda{\mathrm{gt}} \mathcal{L}{\mathrm{gt}} + \lambda{\mathrm{cons}} \mathcal{L}_{\mathrm{cons}}

with $\lambda_{\mathrm{deg}}=0.2$, $\lambda_{\mathrm{gt}}=0.15$, and $\lambda_{\mathrm{cons}}=0.5$. Additionally, the self-attention layers of the base model are updated to capture causal dependencies, and $N_H$ is sampled uniformly from [0, 16] to support flexible inference. The training pipeline, as illustrated in the figure below, progresses from clean pretraining to degradation tuning and finally to history-aware refinement, each stage building upon the previous to enhance controllability, visual fidelity, and temporal coherence. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/monkeyocr/images/ca242194fe25eeaae7eb17c336272500a19e027c38ab1ea4e3dd79b1f654efaf.jpg) During inference, the authors employ two training-free strategies to further improve inter-clip consistency: unified noise initialization, which maintains a single shared noise latent across all clips, and global normalization of depth maps, which computes global 5th and 95th percentiles across the entire video to ensure consistent depth scaling. Point tracks are recomputed per clip using globally normalized depth to preserve motion guidance stability. Captions are refined using Qwen-2.5-VL-7B to align with the visual content of the generated frames, ensuring semantic consistency throughout the sequence. # Experiment - LongVie 2 validated on LongVGenBench (100 high-resolution videos) achieved state-of-the-art performance in controllability (superior SSIM/LPIPS scores), temporal coherence, and visual fidelity across all VBench metrics, surpassing pretrained models (Wan2.1), controllable models (VideoComposer, Go-with-the-Flow), and world models (Hunyuan-GameCraft). - Human evaluation with 60 participants confirmed LongVie 2 consistently outperformed baselines across all dimensions (Visual Quality, Prompt Consistency, Condition Consistency, Color Consistency, Temporal Consistency). - Extended generation tests demonstrated coherent 5-minute video synthesis while maintaining structural stability, motion consistency, and style adaptation in diverse real-world and synthetic scenarios. - Ablation studies proved the necessity of all three training stages: Control Learning enhanced controllability, Degradation-aware training improved visual quality, and History-context guidance ensured long-term temporal consistency. The authors evaluate ablations of LongVie 2 by removing key components such as global normalization, unified initial noise, and degradation strategies, showing that each contributes to visual quality and temporal consistency. Results indicate that omitting any component leads to measurable drops in aesthetic quality, imaging quality, subject consistency, and background consistency. The full model achieves the highest scores across all metrics, confirming the necessity of the integrated design. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/extracted_tables/table-1.png) The authors evaluate LongVie 2 against several baselines in human evaluations across five perceptual dimensions, including visual quality and temporal consistency. Results show LongVie 2 achieves the highest scores in all categories, outperforming Matrix-Game-2.0, Go-With-The-Flow, DiffusionAsShader, and HunyuanGameCraft. This demonstrates its superior perceptual quality and controllability in long video generation. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/extracted_tables/table-2.png) The authors evaluate LongVie 2 against multiple baselines on LongVGenBench, measuring visual quality, controllability, and temporal consistency. Results show LongVie 2 achieves the highest scores in aesthetic quality, imaging quality, SSIM, and subject consistency, while also leading in background consistency and dynamic degree. These metrics confirm LongVie 2’s superior performance in generating long, controllable, and temporally coherent videos. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/extracted_tables/table-3.png) The authors use a staged training strategy to progressively enhance LongVie 2, with each stage improving visual quality, controllability, and temporal consistency. Results show that adding History Context yields the highest gains across all metrics, particularly in aesthetic quality, imaging quality, and temporal coherence. The final model achieves state-of-the-art performance by integrating multi-modal guidance, degradation-aware training, and history-context alignment. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/extracted_tables/table-4.png) The authors evaluate the impact of degradation strategies on LongVie 2’s performance, showing that adding both encoding and generation degradation improves all metrics: visual quality, controllability, and temporal consistency. Results indicate that combining both degradation types yields the highest scores across aesthetic and imaging quality, SSIM, LPIPS, and all temporal consistency measures. This confirms that degradation-aware training enhances the model’s ability to maintain fidelity and coherence during long video generation. ![](https://api-rsrc.hyper.ai/paper2blog/2512.13604/tex_resource/extracted_tables/table-5.png)

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp