Command Palette
Search for a command to run...
minWM : Un framework full-stack open-source pour des modèles du monde vidéo interactifs en temps réel
minWM : Un framework full-stack open-source pour des modèles du monde vidéo interactifs en temps réel
Résumé
Les modèles de base de diffusion vidéo ont récemment réalisé des progrès remarquables dans la génération vidéo de haute qualité, cependant leur transformation en modèles de monde vidéo interactifs en temps réel demeure un défi. Les modèles de monde interactifs nécessitent un rollout contrôlable, causal et à faible latence, ce qui impose en pratique un pipeline complet couvrant la construction des données, le réglage fin contrôlable, l'entraînement autoregressif, la distillation en quelques pas et l'inférence en streaming. Dans ce travail, nous présentons minWM, un framework open-source full-stack dédié à la construction de modèles de monde vidéo interactifs en temps réel. minWM propose un pipeline de bout en bout permettant de convertir les modèles de base vidéo bidirectionnels T2V/TI2V existants en modèles de monde autoregressifs en quelques pas, contrôlables par caméra. Plus précisément, minWM effectue d'abord un réglage fin d'un modèle de diffusion vidéo bidirectionnel intégrant un contrôle de caméra, puis applique le pipeline Causal Forcing / Causal Forcing++, qui comprend l'entraînement de diffusion AR, la distillation ODE causale ou de cohérence causale, ainsi que le DMD asymétrique, afin de le distiller en un générateur autoregressif en quelques pas destiné à un rollout à faible latence. Le framework est modulaire et extensible au niveau de l'architecture : nous l'instancions sur des backbones ouverts représentatifs, notamment Wan2.1-T2V-1.3B et HY1.5-TI2V-8B, couvrant à la fois l'injection de conditions basée sur l'attention croisée et les architectures de type MMDiT. minWM permet également d'adapter des modèles de monde vidéo existants, tels que HY-WorldPlay, à de nouvelles distributions de données, de nouvelles recettes d'entraînement et à des objectifs de latence spécifiques. Au-delà de la mise à disposition de scripts exécutables, de points de contrôle, de documentation et de code d'inférence, nous proposons des études ablatives pratiques concernant la qualité de la trajectoire de la caméra, les étapes d'entraînement de la contrôlabilité et les exigences minimales de taille de lot. Nous espérons que minWM constituera une recette reproductible et extensible pour la construction et l'adaptation de modèles de monde vidéo interactifs en temps réel. Page du projet : https://github.com/shengshu-ai/minWM
One-sentence Summary
minWM is a modular full-stack open-source framework that converts bidirectional video diffusion models into camera-controllable, few-step autoregressive world models via a pipeline combining camera-conditioned fine-tuning, causal forcing, causal consistency distillation, and asymmetric DMD to enable low-latency interactive generation on architectures such as Wan2.1-T2V-1.3B and HY1.5-TI2V-8B.
Key Contributions
- This work introduces minWM, a full-stack open-source framework that provides a modular, end-to-end pipeline for converting existing bidirectional text-to-video and text-and-image-to-video foundation models into camera-controllable autoregressive world models. The framework unifies data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference into a reproducible workflow.
- The framework executes a two-phase conversion recipe that first fine-tunes a bidirectional diffusion backbone on camera-annotated data to enable trajectory control. It subsequently applies Causal Forcing or Causal Forcing++ pipelines, combining autoregressive diffusion training, causal ODE or consistency distillation, and asymmetric DMD post-training to distill the model into a few-step autoregressive generator for low-latency rollout.
- The pipeline is instantiated on Wan2.1-T2V-1.3B and HY1.5-TI2V-8B backbones to demonstrate real-time interactive video generation across cross-attention and MMDiT architectures. By releasing intermediate checkpoints for each training stage and providing ablation studies on camera trajectory quality and training configurations, the framework offers actionable guidance for reproducible world model development.
Introduction
High-quality diffusion-based video foundation models have significantly advanced visual generation, yet they function as offline generators rather than interactive world models. Real-time interactive applications demand causal rollout, responsive camera control, and low-latency frame synthesis, but existing conversion techniques remain scattered across disconnected pipelines that require extensive manual effort across data preparation, fine-tuning, autoregressive training, and distillation. To bridge this gap, the authors introduce minWM, a full-stack open-source framework that unifies the entire workflow into a single reproducible pipeline. The authors leverage a two-phase strategy that first fine-tunes bidirectional video backbones for camera controllability and then applies causal forcing alongside asymmetric distillation to convert them into few-step autoregressive generators. This modular architecture enables researchers to seamlessly adapt existing foundation models into real-time, camera-controlled video world models while supporting mid-pipeline checkpointing and customizable training configurations.
Method
The authors present minWM, a full-stack framework for constructing real-time interactive video world models, which operates through a two-phase pipeline. The overall architecture begins with data preparation, proceeds through a training phase that includes bidirectional diffusion fine-tuning and autoregressive distillation, and culminates in low-latency inference. The process starts with inputs comprising an image, text, and an action signal, which are processed through data filtering and rebalancing, followed by structured annotation to generate a dataset suitable for training. This dataset is then used to fine-tune a bidirectional diffusion model with camera controllability, leveraging PROPE as the injection method for camera parameters. The framework supports multiple backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, enabling the conversion of existing video foundation models into camera-controllable few-step autoregressive generators.
As shown in the figure below, the training phase is divided into two stages. Phase 1 involves bidirectional diffusion training on the fine-tuned model, where camera control is integrated via PROPE. This phase equips the model with the ability to condition on camera trajectories while maintaining the original self-attention structure. Phase 2, AR Diffusion Distillation, applies the Causal Forcing or Causal Forcing++ pipeline to transform the bidirectional model into a real-time interactive autoregressive model. This includes AR diffusion training, causal ODE or causal consistency distillation initialization, and asymmetric DMD post-training. The final stage, inference, enables streaming VAE decoding and prompt engineering, resulting in real-time interactive video outputs. The framework is modular and extensible, supporting the adaptation of existing models to new data distributions and latency targets.
Experiment
The evaluation trains two base video generation models using an autoregressive framework with few-step distillation to assess inference efficiency and camera control. Qualitative results demonstrate that the approach significantly reduces first-frame latency, enabling seamless playback during generation, while successfully preserving camera-controllable capabilities. Ablation studies further reveal that robust camera control depends on high-quality ground-truth trajectory data, sufficient training iterations, and a minimum batch size to ensure stable optimization.
The authors compare the first-frame latency and speedup of multi-step bidirectional and few-step autoregressive models based on two different base models. Results show that the few-step AR model significantly reduces first-frame latency and achieves substantial speedup over the multi-step bidirectional baseline, while maintaining camera-controllable generation capabilities. Few-step AR models achieve substantial reductions in first-frame latency compared to multi-step bidirectional models. The few-step AR model provides significant speedup over the multi-step bidirectional baseline for both base models. Camera-controllable generation is preserved in the few-step AR models despite the latency improvements.
The evaluation compares the initial generation latency and processing speed of multi-step bidirectional models against few-step autoregressive architectures across two distinct base models. Results demonstrate that the few-step autoregressive approach substantially accelerates first-frame generation and overall inference efficiency. Importantly, these performance improvements are achieved while successfully preserving the models' camera-controllable generation capabilities.