HyperAIHyperAI

Command Palette

Search for a command to run...

vor 2 Tagen

Matrix-Game 3.0: Ein Echtzeit- und Streaming-basiertes interaktives World Model mit Long-Horizon Memory

Zusammenfassung

Da Sie mich gebeten haben, die Übersetzung gemäß den professionellen Standards für Technologie und Wissenschaft durchzuführen, aber die Antwort in deutscher Sprache zu verfassen, präsentiere ich Ihnen hier die präzise chinesische Übersetzung des bereitgestellten Textes (da die Aufgabe darin bestand, den englischen Text ins Chinesische zu übersetzen, die Antwort aber auf Deutsch erfolgen soll).Hier ist die professionelle Übersetzung ins Chinesische:中文翻译版本:随着交互式视频生成技术的进步,Diffusion Models 已日益展现出作为世界模型(World Models)的巨大潜力。然而,现有方法仍难以同时实现具备记忆能力的长期时空一致性(Long-term Temporal Consistency)与高分辨率的实时生成,这限制了它们在现实场景中的应用。为了解决这一问题,我们推出了 Matrix-Game 3.0,这是一款旨在实现 720p 实时长视频生成的记忆增强型交互式世界模型。基于 Matrix-Game 2.0,我们在数据、模型和 Inference 方面引入了系统性的改进。首先,我们开发了一套升级后的工业级无限数据引擎,该引擎集成了基于 Unreal Engine 的合成数据、来自 AAA 级游戏的大规模自动化采集数据以及现实世界视频的增强数据,从而能够大规模生产高质量的“视频-姿态-动作-Prompt”四元组数据。其次,我们提出了一种用于长时程一致性(Long-horizon Consistency)的 Training 框架:通过对预测残差进行建模并在 Training 过程中重新注入不完美的生成帧,使基础模型具备自我纠错能力;同时,通过具备相机感知能力的记忆检索与注入机制,使基础模型能够实现长时程的时空一致性。第三,我们设计了一种基于 Distribution Matching Distillation (DMD) 的多段自回归蒸馏策略,并结合模型量化与 VAE decoder 剪枝技术,以实现高效的实时 Inference。实验结果表明,Matrix-Game 3.0 在使用 5B 参数模型时,能在 720p 分辨率下实现高达 40 FPS 的实时生成,同时在长达数分钟的序列中保持稳定的记忆一致性。通过扩展至 2x14B 模型,进一步提升了生成质量、动态效果和泛化能力。我们的方法为实现工业级可部署的世界模型提供了一条切实可行的路径。Anmerkungen zur Übersetzung (德语说明):Fachterminologie: Gemäß Ihren Anweisungen wurden AI-Fachbegriffe wie Diffusion Models, Inference, Training, Prompt, VAE decoder usw. in ihrer englischen Form belassen, um die wissenschaftliche Präzision zu gewährleisten.Stilistik: Die Übersetzung verwendet einen formalen, objektiven akademischen Stil (Schriftsprache), der in chinesischen Tech-Publikationen üblich ist.Genauigkeit: Komplexe Konzepte wie "prediction residuals" (预测残差) und "long-horizon spatiotemporal consistency" (长时程时空一致性) wurden präzise übertragen, um den technologischen Gehalt des Originals exakt wiederzugeben.

One-sentence Summary

The authors propose Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time long-form video generation that utilizes an industrial-scale infinite data engine and a training framework incorporating prediction residual modeling and camera-aware memory retrieval to achieve long-horizon spatiotemporal consistency.

Key Contributions

  • The paper introduces Matrix-Game 3.0, a memory-augmented interactive world model capable of generating 720p high-resolution video in real time.
  • An upgraded industrial-scale infinite data engine is developed to produce high-quality Video-Pose-Action-Prompt quadruplets by integrating Unreal Engine synthetic data, automated AAA game collections, and real-world video augmentation.
  • A novel training framework for long-horizon consistency is proposed that utilizes prediction residual modeling and re-injected generated frames for self-correction alongside camera-aware memory retrieval and injection to maintain spatiotemporal consistency.

Introduction

Interactive world models are essential for simulating complex environments in robotics, gaming, and extended reality by predicting future observations based on user actions. While diffusion models have advanced video synthesis, existing approaches struggle to balance long-term spatiotemporal consistency with the high-resolution, real-time performance required for practical deployment. Current methods often face a trade-off where increasing memory or context length leads to prohibitive latency or a loss of geometric stability.

The authors leverage a co-designed framework across data, modeling, and deployment to introduce Matrix-Game 3.0. They develop an industrial-scale data engine using Unreal Engine 5 and AAA game captures to provide high-quality video-pose-action-prompt quadruplets. To ensure stability, the authors implement a camera-aware memory retrieval mechanism and an error-aware training framework that enables the model to learn self-correction. Finally, they utilize a multi-segment autoregressive distillation strategy combined with model quantization and VAE pruning to achieve 720p generation at up to 40 FPS.

Dataset

Dataset overview
Dataset overview

The authors developed a robust data system designed for large-scale world model training by integrating synthetic and real-world data through a unified pipeline.

  • Dataset Composition and Sources The dataset combines synthetic data from Unreal Engine-based first-person generation and AAA game recordings with four primary real-world video sources:

    • DL3DV-10K: Over 10,000 4K video sequences across 65 point-of-interest categories.
    • RealEstate10K: Indoor real-estate walkthroughs featuring static scenes and clean camera trajectories.
    • OmniWorld-CityWalk: First-person urban walking footage from YouTube captured under various weather and lighting conditions.
    • SpatialVid-HD: The largest subset, covering high-definition pedestrian, driving, and drone-aerial scenarios to improve long-tail viewpoint coverage.
  • Data Processing and Metadata Construction

    • Uniform Re-annotation: To ensure consistency in coordinate conventions and pose representations, the authors re-annotate all real-world data using ViPE rather than relying on bundled annotations.
    • Hierarchical Textual Annotation: Using InternVL3.5-8B, the authors generate structured descriptions for every clip based on a four-tier schema: narrative captions for holistic summaries, static scene captions for appearance modeling, dense temporal captions for event and motion labels, and perceptual quality scores.
    • Perceptual Quality Scoring: Each clip is rated from 0 to 10 across five dimensions: motion smoothness, background dynamics, scene complexity, physics plausibility, and overall quality.
  • Filtering and Curation The authors implement a multi-stage filtering process to remove 20% of the raw data and ensure high quality:

    • Trajectory and Speed Filtering: Three criteria are used to eliminate abnormal motion: local geometric consistency (via depth reprojection error), global motion anomaly (via max-to-median displacement ratio), and camera speed filtering (based on median velocity).
    • Quality Filtering: Clips are further vetted using the perceptual quality scores to ensure the final training set is highly curated.

Method

The Matrix-Game 3.0 framework is designed to address the challenges of long-horizon generation and real-time inference in interactive world models. The system integrates four key components: an error-aware interactive base model, a camera-aware long-horizon memory mechanism, a training-inference aligned few-step distillation pipeline, and a real-time inference acceleration module. These components are coordinated to enable stable, high-resolution, and real-time generation with large models.

The core of the framework is the error-aware interactive base model, which is built upon a bidirectional diffusion Transformer. This architecture ensures that the model can maintain consistency during long-term, autoregressive generation while supporting precise action control. The model processes a sequence of video latents, partitioned into past frames that serve as history conditions and current frames to be predicted. Gaussian noise is added to the current frames before they are concatenated with the past frames and fed into the Transformer. The training objective is a flow-matching loss applied only to the current frames. To enable robust action control, discrete keyboard actions are incorporated via a dedicated Cross-Attention module, while continuous mouse-control signals are injected through Self-Attention. The model is also trained with imperfect historical contexts to ensure consistency with the subsequent distillation stage. A critical aspect of this design is the self-correcting formulation, which uses an error buffer to collect and inject residuals, simulating exposure errors during training.

Illustration of the interactive base model
Illustration of the interactive base model

To enhance long-horizon generation, the framework incorporates a camera-aware long-horizon memory mechanism. This mechanism is built upon the base model and uses a unified Diffusion Transformer (DiT) to jointly model long-term memory, short-term history, and the current prediction target. Instead of treating memory as a separate branch, retrieved memory latents, past frame latents, and current prediction latents are placed in the same attention space, allowing for direct information exchange. This joint modeling is more compatible with streaming generation than a separate memory pathway. The memory selection is camera-aware, retrieving frames based on camera pose and field-of-view overlap to ensure only view-relevant content is used. The relative geometry between the current target and the selected memory is encoded using Plücker-style cues to help the model reason about scene alignment across different viewpoints. To reduce the train-inference mismatch, the memory pathway also uses error collection and injection on both the retrieved memory and past frames. Additionally, the model's temporal awareness is strengthened by injecting the original frame index into the rotary positional encoding and by introducing a head-wise perturbed RoPE base to mitigate positional aliasing and discourage over-reliance on distant memory.

Illustration of the memory-augmented base model
Illustration of the memory-augmented base model

The training-inference aligned few-step distillation pipeline ensures that the distilled model can perform stable few-step long-horizon generation. This is achieved by training the bidirectional student model to mimic the actual inference process. The student performs multi-segment rollouts, where each segment starts from random noise, and the past frames are taken from the tail of the previous segment. This multi-segment scheme creates a training environment that closely matches the inference behavior, thereby reducing exposure bias. The distillation objective is based on Distribution Matching Distillation (DMD), which minimizes the reverse KL divergence between the student's generated distribution and the data distribution at sampled timesteps. The gradient of this objective is approximated by the difference between the score functions of the data and the generated samples.

Illustration of the few-step distillation stage
Illustration of the few-step distillation stage

Finally, the real-time inference acceleration module ensures that the distilled model achieves high-speed inference. This is accomplished through several strategies: INT8 quantization of the DiT model's attention projection layers to reduce computation, VAE pruning to accelerate decoding, and GPU-based memory retrieval. The VAE is pruned to a lightweight version, MG-LightVAE, which achieves significant decoding speedups. The retrieval process is accelerated by using a GPU-based, sampling-based approximation for camera-aware memory retrieval, which is more efficient than the exact CPU-based method for long iterative generation. These optimizations enable the full pipeline to achieve up to 40 FPS inference with a 5B model at 720p resolution.

Experiment

The evaluation assesses the interactive base model, its distilled version, and various acceleration strategies to validate long-range scene consistency and inference efficiency. Results show that the memory-augmented base model and its distilled counterpart effectively reconstruct previously visited viewpoints and maintain stable scene layouts during long-horizon generation. Furthermore, combining INT8 quantization, VAE pruning, and GPU-based memory retrieval significantly enhances throughput, with pruned VAE variants successfully balancing reconstruction quality and real-time performance.

The study evaluates the impact of different acceleration components on inference speed. Results show that removing individual components reduces frames per second, with GPU retrieval having the most significant effect on performance. Removing GPU retrieval causes the largest drop in frames per second. INT8 quantization and MG-LightVAE both contribute to improved inference efficiency. The full configuration achieves the highest throughput, indicating synergistic benefits from combined optimizations.

Inference speed ablation study
Inference speed ablation study

The authors compare the reconstruction quality and efficiency of pruned variants of MG-LightVAE against the original Wan2.2 VAE. Results show that pruning reduces inference time while maintaining acceptable reconstruction fidelity, with higher pruning ratios leading to greater speedup at the cost of some quality. Pruning reduces inference time for both full and decoder-only reconstruction Higher pruning ratios lead to greater speedup but larger quality degradation The 50% pruned variant maintains strong reconstruction quality with significant efficiency gains

VAE pruning efficiency comparison
VAE pruning efficiency comparison

The study evaluates the impact of various acceleration components and pruning ratios on inference speed and reconstruction quality. Ablation experiments demonstrate that combining GPU retrieval, INT8 quantization, and MG-LightVAE creates a synergistic effect that maximizes throughput. Additionally, pruning the VAE offers a way to significantly reduce inference time, with moderate pruning levels successfully balancing efficiency gains against reconstruction fidelity.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp