HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 2 jours

Matrix-Game 3.0 : Modèle de monde interactif en temps réel et en streaming avec une mémoire à long horizon.

Résumé

Voici la traduction professionnelle de votre texte en français, respectant les standards de la rédaction scientifique et technologique :Avec les progrès de la génération de vidéos interactives, les modèles Diffusion ont de plus en plus démontré leur potentiel en tant que modèles de monde (world models). Cependant, les approches existantes peinent encore à concilier simultanément une cohérence temporelle à long terme dotée de capacités de mémoire et une génération en temps réel à haute résolution, ce qui limite leur applicabilité dans des scénarios réels. Pour répondre à ce défi, nous présentons Matrix-Game 3.0, un modèle de monde interactif augmenté par la mémoire, conçu pour la génération de vidéos longues en temps réel en 720p. En nous appuyant sur Matrix-Game 2.0, nous introduisons des améliorations systématiques au niveau des données, du modèle et de l'inference. Premièrement, nous développons un moteur de données infinies à l'échelle industrielle amélioré, qui intègre des données synthétiques basées sur Unreal Engine, une collecte automatisée à grande échelle issue de jeux AAA, et de l'augmentation de vidéos réelles, afin de produire massivement des quadruplets de données de haute qualité de type Vidéo-Pose-Action-Prompt. Deuxièmement, nous proposons un cadre d'training pour la cohérence à long terme : en modélisant les résidus de prédiction et en réinjectant des images générées imparfaites durant l'training, le modèle de base apprend l'auto-correction ; parallèlement, la récupération et l'injection de mémoire sensibles aux mouvements de caméra (camera-aware) permettent au modèle de base d'atteindre une cohérence spatio-temporelle sur de longues durées. Troisièmement, nous concevons une stratégie de distillation autoregressive multi-segments basée sur la Distribution Matching Distillation (DMD), combinée à la quantification du modèle et à l'élagage (pruning) du décodeur VAE, afin de parvenir à une inference en temps réel efficace. Les résultats expérimentaux montrent que Matrix-Game 3.0 atteint une génération en temps réel allant jusqu'à 40 FPS en résolution 720p avec un modèle de 5B, tout en maintenant une cohérence mémorielle stable sur des séquences de plusieurs minutes. Le passage à une échelle supérieure avec un modèle 2x14B améliore davantage la qualité de génération, la dynamique et la généralisation. Notre approche offre une voie pratique vers des modèles de monde déployables à l'échelle industrielle.

One-sentence Summary

The authors propose Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time long-form video generation that utilizes an industrial-scale infinite data engine and a training framework incorporating prediction residual modeling and camera-aware memory retrieval to achieve long-horizon spatiotemporal consistency.

Key Contributions

  • The paper introduces Matrix-Game 3.0, a memory-augmented interactive world model capable of generating 720p high-resolution video in real time.
  • An upgraded industrial-scale infinite data engine is developed to produce high-quality Video-Pose-Action-Prompt quadruplets by integrating Unreal Engine synthetic data, automated AAA game collections, and real-world video augmentation.
  • A novel training framework for long-horizon consistency is proposed that utilizes prediction residual modeling and re-injected generated frames for self-correction alongside camera-aware memory retrieval and injection to maintain spatiotemporal consistency.

Introduction

Interactive world models are essential for simulating complex environments in robotics, gaming, and extended reality by predicting future observations based on user actions. While diffusion models have advanced video synthesis, existing approaches struggle to balance long-term spatiotemporal consistency with the high-resolution, real-time performance required for practical deployment. Current methods often face a trade-off where increasing memory or context length leads to prohibitive latency or a loss of geometric stability.

The authors leverage a co-designed framework across data, modeling, and deployment to introduce Matrix-Game 3.0. They develop an industrial-scale data engine using Unreal Engine 5 and AAA game captures to provide high-quality video-pose-action-prompt quadruplets. To ensure stability, the authors implement a camera-aware memory retrieval mechanism and an error-aware training framework that enables the model to learn self-correction. Finally, they utilize a multi-segment autoregressive distillation strategy combined with model quantization and VAE pruning to achieve 720p generation at up to 40 FPS.

Dataset

Dataset overview
Dataset overview

The authors developed a robust data system designed for large-scale world model training by integrating synthetic and real-world data through a unified pipeline.

  • Dataset Composition and Sources The dataset combines synthetic data from Unreal Engine-based first-person generation and AAA game recordings with four primary real-world video sources:

    • DL3DV-10K: Over 10,000 4K video sequences across 65 point-of-interest categories.
    • RealEstate10K: Indoor real-estate walkthroughs featuring static scenes and clean camera trajectories.
    • OmniWorld-CityWalk: First-person urban walking footage from YouTube captured under various weather and lighting conditions.
    • SpatialVid-HD: The largest subset, covering high-definition pedestrian, driving, and drone-aerial scenarios to improve long-tail viewpoint coverage.
  • Data Processing and Metadata Construction

    • Uniform Re-annotation: To ensure consistency in coordinate conventions and pose representations, the authors re-annotate all real-world data using ViPE rather than relying on bundled annotations.
    • Hierarchical Textual Annotation: Using InternVL3.5-8B, the authors generate structured descriptions for every clip based on a four-tier schema: narrative captions for holistic summaries, static scene captions for appearance modeling, dense temporal captions for event and motion labels, and perceptual quality scores.
    • Perceptual Quality Scoring: Each clip is rated from 0 to 10 across five dimensions: motion smoothness, background dynamics, scene complexity, physics plausibility, and overall quality.
  • Filtering and Curation The authors implement a multi-stage filtering process to remove 20% of the raw data and ensure high quality:

    • Trajectory and Speed Filtering: Three criteria are used to eliminate abnormal motion: local geometric consistency (via depth reprojection error), global motion anomaly (via max-to-median displacement ratio), and camera speed filtering (based on median velocity).
    • Quality Filtering: Clips are further vetted using the perceptual quality scores to ensure the final training set is highly curated.

Method

The Matrix-Game 3.0 framework is designed to address the challenges of long-horizon generation and real-time inference in interactive world models. The system integrates four key components: an error-aware interactive base model, a camera-aware long-horizon memory mechanism, a training-inference aligned few-step distillation pipeline, and a real-time inference acceleration module. These components are coordinated to enable stable, high-resolution, and real-time generation with large models.

The core of the framework is the error-aware interactive base model, which is built upon a bidirectional diffusion Transformer. This architecture ensures that the model can maintain consistency during long-term, autoregressive generation while supporting precise action control. The model processes a sequence of video latents, partitioned into past frames that serve as history conditions and current frames to be predicted. Gaussian noise is added to the current frames before they are concatenated with the past frames and fed into the Transformer. The training objective is a flow-matching loss applied only to the current frames. To enable robust action control, discrete keyboard actions are incorporated via a dedicated Cross-Attention module, while continuous mouse-control signals are injected through Self-Attention. The model is also trained with imperfect historical contexts to ensure consistency with the subsequent distillation stage. A critical aspect of this design is the self-correcting formulation, which uses an error buffer to collect and inject residuals, simulating exposure errors during training.

Illustration of the interactive base model
Illustration of the interactive base model

To enhance long-horizon generation, the framework incorporates a camera-aware long-horizon memory mechanism. This mechanism is built upon the base model and uses a unified Diffusion Transformer (DiT) to jointly model long-term memory, short-term history, and the current prediction target. Instead of treating memory as a separate branch, retrieved memory latents, past frame latents, and current prediction latents are placed in the same attention space, allowing for direct information exchange. This joint modeling is more compatible with streaming generation than a separate memory pathway. The memory selection is camera-aware, retrieving frames based on camera pose and field-of-view overlap to ensure only view-relevant content is used. The relative geometry between the current target and the selected memory is encoded using Plücker-style cues to help the model reason about scene alignment across different viewpoints. To reduce the train-inference mismatch, the memory pathway also uses error collection and injection on both the retrieved memory and past frames. Additionally, the model's temporal awareness is strengthened by injecting the original frame index into the rotary positional encoding and by introducing a head-wise perturbed RoPE base to mitigate positional aliasing and discourage over-reliance on distant memory.

Illustration of the memory-augmented base model
Illustration of the memory-augmented base model

The training-inference aligned few-step distillation pipeline ensures that the distilled model can perform stable few-step long-horizon generation. This is achieved by training the bidirectional student model to mimic the actual inference process. The student performs multi-segment rollouts, where each segment starts from random noise, and the past frames are taken from the tail of the previous segment. This multi-segment scheme creates a training environment that closely matches the inference behavior, thereby reducing exposure bias. The distillation objective is based on Distribution Matching Distillation (DMD), which minimizes the reverse KL divergence between the student's generated distribution and the data distribution at sampled timesteps. The gradient of this objective is approximated by the difference between the score functions of the data and the generated samples.

Illustration of the few-step distillation stage
Illustration of the few-step distillation stage

Finally, the real-time inference acceleration module ensures that the distilled model achieves high-speed inference. This is accomplished through several strategies: INT8 quantization of the DiT model's attention projection layers to reduce computation, VAE pruning to accelerate decoding, and GPU-based memory retrieval. The VAE is pruned to a lightweight version, MG-LightVAE, which achieves significant decoding speedups. The retrieval process is accelerated by using a GPU-based, sampling-based approximation for camera-aware memory retrieval, which is more efficient than the exact CPU-based method for long iterative generation. These optimizations enable the full pipeline to achieve up to 40 FPS inference with a 5B model at 720p resolution.

Experiment

The evaluation assesses the interactive base model, its distilled version, and various acceleration strategies to validate long-range scene consistency and inference efficiency. Results show that the memory-augmented base model and its distilled counterpart effectively reconstruct previously visited viewpoints and maintain stable scene layouts during long-horizon generation. Furthermore, combining INT8 quantization, VAE pruning, and GPU-based memory retrieval significantly enhances throughput, with pruned VAE variants successfully balancing reconstruction quality and real-time performance.

The study evaluates the impact of different acceleration components on inference speed. Results show that removing individual components reduces frames per second, with GPU retrieval having the most significant effect on performance. Removing GPU retrieval causes the largest drop in frames per second. INT8 quantization and MG-LightVAE both contribute to improved inference efficiency. The full configuration achieves the highest throughput, indicating synergistic benefits from combined optimizations.

Inference speed ablation study
Inference speed ablation study

The authors compare the reconstruction quality and efficiency of pruned variants of MG-LightVAE against the original Wan2.2 VAE. Results show that pruning reduces inference time while maintaining acceptable reconstruction fidelity, with higher pruning ratios leading to greater speedup at the cost of some quality. Pruning reduces inference time for both full and decoder-only reconstruction Higher pruning ratios lead to greater speedup but larger quality degradation The 50% pruned variant maintains strong reconstruction quality with significant efficiency gains

VAE pruning efficiency comparison
VAE pruning efficiency comparison

The study evaluates the impact of various acceleration components and pruning ratios on inference speed and reconstruction quality. Ablation experiments demonstrate that combining GPU retrieval, INT8 quantization, and MG-LightVAE creates a synergistic effect that maximizes throughput. Additionally, pruning the VAE offers a way to significantly reduce inference time, with moderate pruning levels successfully balancing efficiency gains against reconstruction fidelity.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp