Command Palette
Search for a command to run...
INSPATIO-WORLD: 시공간적 Autoregressive Modeling을 통한 실시간 4D World Simulator
INSPATIO-WORLD: 시공간적 Autoregressive Modeling을 통한 실시간 4D World Simulator
초록
공간적 일관성(spatial consistency)과 실시간 상호작용성(real-time interactivity)을 갖춘 월드 모델(world models)을 구축하는 것은 컴퓨터 비전 분야의 근본적인 과제로 남아 있습니다. 현재의 비디오 생성 패러다임은 공간적 지속성(spatial persistence)의 결여와 시각적 사실성(visual realism)의 부족으로 인해 복잡한 환경 내에서의 원활한 내비게이션을 지원하는 데 어려움을 겪는 경우가 많습니다.이러한 문제를 해결하기 위해, 본 연구에서는 단일 참조 비디오로부터 고정밀(high-fidelity)의 동적 상호작용 장면을 복원하고 생성할 수 있는 새로운 실시간 프레임워크인 INSPATIO-WORLD를 제안합니다. 우리 접근 방식의 핵심은 Spatiotemporal Autoregressive (STAR) 아키텍처로, 이는 밀접하게 결합된 두 가지 구성 요소를 통해 일관되고 제어 가능한 장면 진화를 가능하게 합니다. 첫째, Implicit Spatiotemporal Cache는 참조 및 과거의 관측 데이터를 잠재적 월드 표현(latent world representation)으로 통합하여 장기적 내비게이션(long-horizon navigation) 과정에서 전역적 일관성을 보장합니다. 둘째, Explicit Spatial Constraint Module은 기하학적 구조를 강제하고 사용자의 상호작용을 정밀하며 물리적으로 타당한 카메라 궤적(camera trajectories)으로 변환합니다.나아가, 본 연구에서는 Joint Distribution Matching Distillation (JDMD)를 도입합니다. JDMD는 실제 데이터 분포를 정규화 가이드(regularizing guide)로 활용함으로써, 합성 데이터에 대한 과도한 의존으로 인해 일반적으로 발생하는 fidelity 저하 문제를 효과적으로 극복합니다. 광범위한 실험을 통해 INSPATIO-WORLD가 공간적 일관성과 상호작용 정밀도 측면에서 기존의 SOTA 모델들을 크게 능가함을 입증하였으며, WorldScore-Dynamic benchmark에서 실시간 상호작용 방식 중 1위를 기록하였습니다. 이는 단안 비디오(monocular videos)로부터 재구성된 4D 환경을 탐색하기 위한 실용적인 pipeline을 구축했음을 의미합니다.
One-sentence Summary
The proposed INSPATIO-WORLD framework functions as a real-time 4D world simulator that generates high-fidelity, dynamic interactive scenes from a single reference video by utilizing a Spatiotemporal Autoregressive architecture composed of an Implicit Spatiotemporal Cache for global consistency, an Explicit Spatial Constraint Module for physically plausible navigation, and Joint Distribution Matching Distillation to maintain visual realism.
Key Contributions
- The paper introduces INSPATIO-WORLD, a real-time 4D generative world model that utilizes a Spatiotemporal Autoregressive (STAR) architecture to enable high-fidelity, interactive scene generation from a single reference video. This framework combines an Implicit Spatiotemporal Cache for long-term global consistency with an Explicit Spatial Constraint Module that translates user interactions into physically plausible camera trajectories.
- A Multi-conditional Causal Initialization strategy is presented to improve multi-condition controllable generation by performing chunk-wise autoregressive multi-step rehearsal on ground-truth data or teacher-model trajectories. This method establishes accurate associations between heterogeneous inputs such as preceding frames, reference images, and geometric constraints during the initial training phase.
- The work proposes Joint Distribution Matching Distillation (JDMD), a dual-teacher paradigm that uses real-world data distributions to decouple and optimize motion fidelity and perceptual realism. Experimental results show that this approach bridges the gap between synthetic and physical domains, achieving state-of-the-art spatial continuity and visual precision at a performance of 24 FPS.
Introduction
Building interactive 4D world models is essential for advancing embodied intelligence and autonomous driving, as it allows for realistic, high-degree-of-freedom navigation within simulated environments. However, current video diffusion models struggle with long-horizon roaming due to spatial persistence degradation, a significant synthetic-to-real gap in visual textures, and imprecise control over user-defined camera trajectories. The authors leverage a Spatiotemporal Autoregressive (STAR) architecture to overcome these bottlenecks, utilizing an implicit spatio-temporal cache for global consistency and explicit spatial constraints for precise geometric reasoning. Additionally, they introduce Joint Distribution Matching Distillation (JDMD), a dual-teacher learning framework that aligns model features with real-world data distributions to ensure high visual fidelity without sacrificing motion controllability.
Method
The authors leverage a spatiotemporal autoregressive framework to enable long-horizon, interactive video generation under multimodal constraints. This framework operates by decomposing the generation process into a sequence of chunks, each consisting of K consecutive frames, and modeling the latent sequence Z1:I as a product of conditional probabilities. The generation of each block zi is guided by three distinct conditions: historical context, reference guidance, and geometric constraints, ensuring both temporal continuity and spatial consistency. As shown in the figure below, the core of the system is the Diffusion Transformer (DiT) block, which receives denoised latent representations conditioned on these inputs to produce the next block of video.

The framework integrates a spatiotemporal cache mechanism to maintain long-term memory efficiently. This mechanism combines short-term historical information, represented by the previously generated latent zi−1, with long-term reference information, ziref, retrieved from a reference video. These are aggregated into an implicit ST-Cache, which provides a stable spatiotemporal anchor for the generation process. To mitigate distribution shifts caused by sequence length growth in the Rotary Position Embedding (RoPE), a position index fixing strategy is employed, anchoring the starting positions of the current block, reference anchor, and historical block to a fixed coordinate origin. This stabilizes the model's representation space and enhances spatial consistency. Additionally, a chunk-wise backpropagation strategy is adopted to address differentiability and memory bottlenecks during training. This strategy decouples the forward inference from backward optimization, allowing for full-link differentiability within each chunk while significantly reducing peak memory usage.
To achieve precise camera control, the system incorporates explicit geometric constraints derived from user interaction instructions. The user's rotation, translation, and perspective shift commands are translated into a 6-DoF relative pose transformation ΔTi, which is recursively accumulated to define the global pose Ti for the current block. Based on this pose, the reference features are geometrically aligned with the current viewpoint using a reprojection operation. This process, illustrated in the figure, involves extracting depth maps and camera intrinsics from the reference video latents via a feed-forward reconstruction method. The resulting warped feature ziwarp and a valid pixel mask mi are concatenated and fed into the DiT block as explicit structural guidance. This mechanism functions as a spatial memory proxy, providing deterministic constraints that prevent scene distortion and ensure multi-view consistency.
The training process employs a joint distribution matching distillation (JDMD) strategy to balance motion compliance and visual fidelity. This approach uses a multi-task learning paradigm with two frozen teacher models: a motion teacher trained on synthetic data to guide precise motion control, and a perceptual teacher derived from a real-world text-to-video foundation model to preserve visual richness. During training, the student model alternates between two distillation tasks: a controllable video rerendering (V2V) task that leverages the synthetic data distribution for motion control, and a text-to-video (T2V) task that aligns with the real-world data distribution for visual fidelity. The overall loss is a weighted sum of the vision distillation loss Lvis and the conditional control loss Lctrl, enabling the model to learn both precise spatio-temporal consistency and high-fidelity visual realism. The implementation details reveal that the framework uses a three-stage training process, starting with teacher model training, followed by student initialization, and culminating in the JDMD distillation phase, with specific learning rates for each stage.
Experiment
The effectiveness of INSPATIO-WORLD is evaluated through the WorldScore benchmark for next-scene generation, long-term image-to-video generation for spatial persistence, and camera-controlled video rerendering for instruction adherence. The results demonstrate that the model achieves state-of-the-art performance by maintaining superior geometric consistency and precise camera control across extended sequences without suffering from structural warping or kinetic drift. Furthermore, the framework provides a highly efficient compute-quality trade-off, delivering high-fidelity visual generation and real-time execution capabilities compared to existing methods.
The authors evaluate INSPATIO-WORLD on the WorldScore benchmark, comparing it against multiple state-of-the-art models. Results show that INSPATIO-WORLD achieves superior performance in camera control and photometric quality while maintaining high computational efficiency. INSPATIO-WORLD achieves the best results in camera control accuracy and translation error compared to all listed methods. The method demonstrates the highest generation quality with the lowest FID and FVD scores among the compared models. INSPATIO-WORLD outperforms other models in both control precision and visual quality metrics.

The authors compare INSPATIO-WORLD against state-of-the-art methods on camera-controlled video rerendering using two datasets. Results show that INSPATIO-WORLD achieves superior performance in generation quality and camera control while maintaining high consistency with the reference video. INSPATIO-WORLD outperforms baselines in video quality metrics on both datasets The method achieves high camera control accuracy with minimal trajectory error It maintains superior consistency with the input reference video compared to existing approaches

The authors evaluate INSPATIO-WORLD on the WorldScore benchmark, comparing it against state-of-the-art models. Results show that INSPATIO-WORLD achieves top performance in camera control and photometric quality while maintaining strong overall dynamic scores, outperforming other methods in key metrics. INSPATIO-WORLD achieves the highest scores in camera control and photometric quality among all methods. The model outperforms others in motion smoothness and 3D consistency, demonstrating strong spatiotemporal generation capabilities. INSPATIO-WORLD ranks first in overall dynamic and static scores, showing superior performance in both interactive and non-interactive settings.

Results show that INSPATIO-WORLD achieves high performance on the WorldScore benchmark with superior computational efficiency. The model outperforms others in dynamic metrics while operating at lower computational costs, indicating a strong trade-off between quality and resource usage. INSPATIO-WORLD achieves top performance in dynamic metrics while requiring significantly lower computational resources. The model outperforms existing methods in motion smoothness, camera control accuracy, and photometric quality. It demonstrates a superior compute-quality trade-off, breaking the traditional zero-sum relationship between geometric control and generation fidelity.

The authors evaluate INSPATIO-WORLD against state-of-the-art models using the WorldScore benchmark and various video rerendering datasets to validate its performance in camera control and visual fidelity. The results demonstrate that the model achieves superior camera control accuracy, photometric quality, and temporal consistency while maintaining high motion smoothness. Furthermore, INSPATIO-WORLD provides an exceptional balance between generation quality and computational efficiency, effectively overcoming the traditional trade-off between geometric control and visual detail.