HyperAIHyperAI

Command Palette

Search for a command to run...

한 달 전

Matrix-Game 3.0: Long-Horizon Memory를 갖춘 실시간 및 스트리밍 방식의 인터랙티브 World Model

초록

인터랙티브 비디오 생성 기술이 발전함에 따라, diffusion 모델은 world model로서의 잠재력을 점점 더 입증하고 있습니다. 그러나 기존 방식들은 메모리 기반의 장기적 시간적 일관성(long-term temporal consistency)과 고해상도 실시간 생성을 동시에 달성하는 데 여전히 어려움을 겪고 있으며, 이는 실제 응용 시나리오에서의 활용을 제한하고 있습니다.이러한 문제를 해결하기 위해, 본 연구에서는 720p 실시간 롱폼(longform) 비디오 생성을 위해 설계된 메모리 증강 인터랙티브 world model인 Matrix-Game 3.0을 선보입니다. Matrix-Game 2.0을 기반으로 하여, 데이터, 모델 및 inference 전반에 걸친 체계적인 개선 사항을 도입했습니다.첫째, Unreal Engine 기반의 합성 데이터, AAA 게임으로부터의 대규모 자동 수집 데이터, 그리고 실제 비디오 증강(augmentation)을 통합하여 고품질의 Video-Pose-Action-Prompt 4중 결합(quadruplet) 데이터를 대규모로 생성할 수 있는 업그레이드된 산업 규모의 무한 데이터 엔진을 개발했습니다.둘째, 장기적 일관성(long-horizon consistency)을 위한 training 프레임워크를 제안합니다. prediction residual을 모델링하고 training 과정에서 불완전하게 생성된 프레임을 재주입(re-injecting)함으로써, base model이 자기 수정(self-correction)을 학습하도록 합니다. 동시에, camera-aware memory retrieval 및 injection을 통해 base model이 장기적인 시공간적 일관성을 달성할 수 있도록 합니다.셋째, 효율적인 실시간 inference를 달성하기 위해 Distribution Matching Distillation (DMD) 기반의 multi-segment autoregressive distillation 전략을 설계하였으며, 이를 모델 quantization 및 VAE decoder pruning과 결합했습니다.실험 결과에 따르면, Matrix-Game 3.0은 5B 모델을 사용하여 720p 해상도에서 최대 40 FPS의 실시간 생성을 달성하는 동시에, 수 분(minute-long) 길이의 시퀀스에서도 안정적인 메모리 일관성을 유지합니다. 모델 규모를 2x14B로 확장하면 생성 품질, 역동성(dynamics) 및 일반화 성능이 더욱 향상됩니다. 본 연구의 접근 방식은 산업 규모로 배포 가능한 world model을 향한 실질적인 경로를 제시합니다.

One-sentence Summary

The authors propose Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time long-form video generation that utilizes an industrial-scale infinite data engine and a training framework incorporating prediction residual modeling and camera-aware memory retrieval to achieve long-horizon spatiotemporal consistency.

Key Contributions

  • The paper introduces Matrix-Game 3.0, a memory-augmented interactive world model capable of generating 720p high-resolution video in real time.
  • An upgraded industrial-scale infinite data engine is developed to produce high-quality Video-Pose-Action-Prompt quadruplets by integrating Unreal Engine synthetic data, automated AAA game collections, and real-world video augmentation.
  • A novel training framework for long-horizon consistency is proposed that utilizes prediction residual modeling and re-injected generated frames for self-correction alongside camera-aware memory retrieval and injection to maintain spatiotemporal consistency.

Introduction

Interactive world models are essential for simulating complex environments in robotics, gaming, and extended reality by predicting future observations based on user actions. While diffusion models have advanced video synthesis, existing approaches struggle to balance long-term spatiotemporal consistency with the high-resolution, real-time performance required for practical deployment. Current methods often face a trade-off where increasing memory or context length leads to prohibitive latency or a loss of geometric stability.

The authors leverage a co-designed framework across data, modeling, and deployment to introduce Matrix-Game 3.0. They develop an industrial-scale data engine using Unreal Engine 5 and AAA game captures to provide high-quality video-pose-action-prompt quadruplets. To ensure stability, the authors implement a camera-aware memory retrieval mechanism and an error-aware training framework that enables the model to learn self-correction. Finally, they utilize a multi-segment autoregressive distillation strategy combined with model quantization and VAE pruning to achieve 720p generation at up to 40 FPS.

Dataset

Dataset overview
Dataset overview

The authors developed a robust data system designed for large-scale world model training by integrating synthetic and real-world data through a unified pipeline.

  • Dataset Composition and Sources The dataset combines synthetic data from Unreal Engine-based first-person generation and AAA game recordings with four primary real-world video sources:

    • DL3DV-10K: Over 10,000 4K video sequences across 65 point-of-interest categories.
    • RealEstate10K: Indoor real-estate walkthroughs featuring static scenes and clean camera trajectories.
    • OmniWorld-CityWalk: First-person urban walking footage from YouTube captured under various weather and lighting conditions.
    • SpatialVid-HD: The largest subset, covering high-definition pedestrian, driving, and drone-aerial scenarios to improve long-tail viewpoint coverage.
  • Data Processing and Metadata Construction

    • Uniform Re-annotation: To ensure consistency in coordinate conventions and pose representations, the authors re-annotate all real-world data using ViPE rather than relying on bundled annotations.
    • Hierarchical Textual Annotation: Using InternVL3.5-8B, the authors generate structured descriptions for every clip based on a four-tier schema: narrative captions for holistic summaries, static scene captions for appearance modeling, dense temporal captions for event and motion labels, and perceptual quality scores.
    • Perceptual Quality Scoring: Each clip is rated from 0 to 10 across five dimensions: motion smoothness, background dynamics, scene complexity, physics plausibility, and overall quality.
  • Filtering and Curation The authors implement a multi-stage filtering process to remove 20% of the raw data and ensure high quality:

    • Trajectory and Speed Filtering: Three criteria are used to eliminate abnormal motion: local geometric consistency (via depth reprojection error), global motion anomaly (via max-to-median displacement ratio), and camera speed filtering (based on median velocity).
    • Quality Filtering: Clips are further vetted using the perceptual quality scores to ensure the final training set is highly curated.

Method

The Matrix-Game 3.0 framework is designed to address the challenges of long-horizon generation and real-time inference in interactive world models. The system integrates four key components: an error-aware interactive base model, a camera-aware long-horizon memory mechanism, a training-inference aligned few-step distillation pipeline, and a real-time inference acceleration module. These components are coordinated to enable stable, high-resolution, and real-time generation with large models.

The core of the framework is the error-aware interactive base model, which is built upon a bidirectional diffusion Transformer. This architecture ensures that the model can maintain consistency during long-term, autoregressive generation while supporting precise action control. The model processes a sequence of video latents, partitioned into past frames that serve as history conditions and current frames to be predicted. Gaussian noise is added to the current frames before they are concatenated with the past frames and fed into the Transformer. The training objective is a flow-matching loss applied only to the current frames. To enable robust action control, discrete keyboard actions are incorporated via a dedicated Cross-Attention module, while continuous mouse-control signals are injected through Self-Attention. The model is also trained with imperfect historical contexts to ensure consistency with the subsequent distillation stage. A critical aspect of this design is the self-correcting formulation, which uses an error buffer to collect and inject residuals, simulating exposure errors during training.

Illustration of the interactive base model
Illustration of the interactive base model

To enhance long-horizon generation, the framework incorporates a camera-aware long-horizon memory mechanism. This mechanism is built upon the base model and uses a unified Diffusion Transformer (DiT) to jointly model long-term memory, short-term history, and the current prediction target. Instead of treating memory as a separate branch, retrieved memory latents, past frame latents, and current prediction latents are placed in the same attention space, allowing for direct information exchange. This joint modeling is more compatible with streaming generation than a separate memory pathway. The memory selection is camera-aware, retrieving frames based on camera pose and field-of-view overlap to ensure only view-relevant content is used. The relative geometry between the current target and the selected memory is encoded using Plücker-style cues to help the model reason about scene alignment across different viewpoints. To reduce the train-inference mismatch, the memory pathway also uses error collection and injection on both the retrieved memory and past frames. Additionally, the model's temporal awareness is strengthened by injecting the original frame index into the rotary positional encoding and by introducing a head-wise perturbed RoPE base to mitigate positional aliasing and discourage over-reliance on distant memory.

Illustration of the memory-augmented base model
Illustration of the memory-augmented base model

The training-inference aligned few-step distillation pipeline ensures that the distilled model can perform stable few-step long-horizon generation. This is achieved by training the bidirectional student model to mimic the actual inference process. The student performs multi-segment rollouts, where each segment starts from random noise, and the past frames are taken from the tail of the previous segment. This multi-segment scheme creates a training environment that closely matches the inference behavior, thereby reducing exposure bias. The distillation objective is based on Distribution Matching Distillation (DMD), which minimizes the reverse KL divergence between the student's generated distribution and the data distribution at sampled timesteps. The gradient of this objective is approximated by the difference between the score functions of the data and the generated samples.

Illustration of the few-step distillation stage
Illustration of the few-step distillation stage

Finally, the real-time inference acceleration module ensures that the distilled model achieves high-speed inference. This is accomplished through several strategies: INT8 quantization of the DiT model's attention projection layers to reduce computation, VAE pruning to accelerate decoding, and GPU-based memory retrieval. The VAE is pruned to a lightweight version, MG-LightVAE, which achieves significant decoding speedups. The retrieval process is accelerated by using a GPU-based, sampling-based approximation for camera-aware memory retrieval, which is more efficient than the exact CPU-based method for long iterative generation. These optimizations enable the full pipeline to achieve up to 40 FPS inference with a 5B model at 720p resolution.

Experiment

The evaluation assesses the interactive base model, its distilled version, and various acceleration strategies to validate long-range scene consistency and inference efficiency. Results show that the memory-augmented base model and its distilled counterpart effectively reconstruct previously visited viewpoints and maintain stable scene layouts during long-horizon generation. Furthermore, combining INT8 quantization, VAE pruning, and GPU-based memory retrieval significantly enhances throughput, with pruned VAE variants successfully balancing reconstruction quality and real-time performance.

The study evaluates the impact of different acceleration components on inference speed. Results show that removing individual components reduces frames per second, with GPU retrieval having the most significant effect on performance. Removing GPU retrieval causes the largest drop in frames per second. INT8 quantization and MG-LightVAE both contribute to improved inference efficiency. The full configuration achieves the highest throughput, indicating synergistic benefits from combined optimizations.

Inference speed ablation study
Inference speed ablation study

The authors compare the reconstruction quality and efficiency of pruned variants of MG-LightVAE against the original Wan2.2 VAE. Results show that pruning reduces inference time while maintaining acceptable reconstruction fidelity, with higher pruning ratios leading to greater speedup at the cost of some quality. Pruning reduces inference time for both full and decoder-only reconstruction Higher pruning ratios lead to greater speedup but larger quality degradation The 50% pruned variant maintains strong reconstruction quality with significant efficiency gains

VAE pruning efficiency comparison
VAE pruning efficiency comparison

The study evaluates the impact of various acceleration components and pruning ratios on inference speed and reconstruction quality. Ablation experiments demonstrate that combining GPU retrieval, INT8 quantization, and MG-LightVAE creates a synergistic effect that maximizes throughput. Additionally, pruning the VAE offers a way to significantly reduce inference time, with moderate pruning levels successfully balancing efficiency gains against reconstruction fidelity.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp