5달 전

Hanyang Kong Xingyi Yang Xiaoxu Zheng Xinchao Wang

초록

장거리에 걸쳐 기하학적으로 일관성 있는 비디오를 생성하는 것은 근본적인 딜레마를 안고 있다. 일관성은 픽셀 공간에서 3차원 기하학에 엄격히 부합해야 하지만, 최신의 생성 모델들은 카메라 조건이 부여된 잠재 공간에서 가장 효과적으로 작동한다. 이 간극은 기존 방법이 가려진 영역과 복잡한 카메라 경로를 다루는 데 어려움을 겪게 만든다. 이러한 격차를 메우기 위해, 우리는 3차원 구조적 기준점과 2차원 생성 보정기의 결합을 통해 작동하는 WorldWarp 프레임워크를 제안한다. 기하학적 기반을 확보하기 위해 WorldWarp는 가우시안 스플래팅(3DGS)을 통해 실시간으로 구축된 3차원 기하학적 캐시를 유지한다. 과거 콘텐츠를 새로운 시점으로 명시적으로 왜곡함으로써, 이 캐시는 구조적 골격 역할을 하여 각 새 프레임이 이전의 기하학적 구조를 존중하도록 보장한다. 그러나 정적 왜곡은 가림 현상으로 인해 필연적으로 구멍과 아티팩트를 남긴다. 이를 해결하기 위해 우리는 ‘채우고 수정하는( fill-and-revise)’ 목적을 위해 설계된 시공간 확산(ST-Diff) 모델을 활용한다. 우리의 핵심 혁신은 시공간에 따라 변화하는 노이즈 스케줄링이다. 공백 영역에는 전체 노이즈를 부여하여 생성을 유도하고, 왜곡된 영역에는 부분적 노이즈를 제공하여 보정을 가능하게 한다. 매 단계에서 3차원 캐시를 동적으로 업데이트함으로써, WorldWarp는 비디오 청크 간 일관성을 유지한다. 결과적으로, 3차원 논리가 구조를 지도하고 확산 논리가 텍스처를 완성함으로써, 최고 수준의 정확도를 달성한다. 프로젝트 페이지: https://hyokong.github.io/worldwarp-page/

One-sentence Summary

Researchers from National University of Singapore and The Hong Kong Polytechnic University propose WorldWarp, a novel view synthesis framework generating long coherent videos from a single image by coupling 3D Gaussian Splatting (3DGS) for geometric grounding with a Spatio-Temporal Diffusion (ST-Diff) model. Its key innovation is a spatio-temporal varying noise schedule that fills occlusions with full noise while refining warped content with partial noise, maintaining 3D consistency across 200-frame sequences where prior methods fail.

Key Contributions

WorldWarp addresses the fundamental dilemma in long-range video generation where generative models operate in latent space while geometric consistency requires pixel-space 3D adherence, causing failures in occluded areas and complex trajectories by introducing a chunk-based framework that couples an online 3D geometric cache (built via Gaussian Splatting) with a 2D generative refiner to maintain structural grounding.
The framework's core innovation is a Spatio-Temporal Diffusion (ST-Diff) model using a spatio-temporal varying noise schedule that applies full noise to blank regions for generation and partial noise to warped regions for refinement, enabling effective "fill-and-revise" of occlusions while leveraging bidirectional attention conditioned on forward-warped geometric priors.
WorldWarp achieves state-of-the-art geometric consistency and visual fidelity on challenging view extrapolation benchmarks by dynamically updating the 3D cache at each step, preventing irreversible error propagation and demonstrating superior performance over existing methods in long-sequence generation from limited starting images.

Introduction

Novel view synthesis enables applications like virtual reality and immersive telepresence, but generating views far beyond input camera positions—view extrapolation—remains critical for interactive 3D exploration from limited images. Prior methods face significant limitations: camera pose encoding struggles with out-of-distribution trajectories and lacks 3D scene understanding, while explicit 3D spatial priors suffer from occlusions, geometric distortions, and irreversible error propagation during long sequences. The authors address this by introducing WorldWarp, which avoids error accumulation through an autoregressive pipeline using chunk-based generation. Their core innovation leverages a Spatio-Temporal Diffusion model with bidirectional attention, conditioned on forward-warped images from future camera positions as dense 2D priors, alongside an online 3D Gaussian Splatting cache that dynamically refines geometry using only recent high-fidelity outputs. This approach ensures geometric consistency and visual quality over extended camera paths where prior work fails.

Method

The authors leverage a dual-component architecture to achieve long-range, geometrically consistent novel view synthesis: an online 3D geometric cache for structural grounding and a non-causal Spatio-Temporal Diffusion (ST-Diff) model for texture refinement and occlusion filling. The framework operates autoregressively, generating video chunks iteratively while maintaining global 3D consistency through dynamic cache updates.

At the core of the inference pipeline is the online 3D geometric cache, initialized from either the starting image or the previously generated chunk. This cache is constructed by first estimating camera poses and an initial point cloud using TTT3R, followed by optimizing a 3D Gaussian Splatting (3DGS) representation over several hundred steps. This high-fidelity 3DGS model serves as a structural scaffold, enabling accurate forward-warping of the current history into novel viewpoints. Concurrently, a Vision-Language Model (VLM) generates a descriptive text prompt to guide semantic consistency, while novel camera poses for the next chunk are extrapolated using velocity-based linear translation and SLERP rotation interpolation.

Refer to the autoregressive inference pipeline: the forward-warped images and their corresponding validity masks, derived from the 3DGS cache, are encoded into the latent space. The ST-Diff model then initializes the reverse diffusion process from a spatially-varying noise level. For each frame, the noise map is constructed using the latent-space mask: valid (warped) regions are initialized with a reduced noise level $\sigma_{\text{start}}$ controlled by a strength parameter $\tau$ , preserving geometric structure, while occluded (blank) regions are initialized with pure noise ( $\sigma_{\text{filled}} \approx 1.0$ ) to enable generative inpainting. The model $G_{\theta}$ takes this mixed-noise latent sequence, the VLM prompt, and spatially-varying time embeddings as input, denoising the sequence over 50 steps to produce the next chunk of novel views. This newly generated chunk becomes the history for the next iteration, ensuring long-term coherence.

During training, the ST-Diff model is conditioned on a composite latent sequence $\mathcal{Z}_c$ constructed from warped priors and ground-truth latents. The composite is formed by combining valid warped regions from $\mathbf{z}_{ {s\rightarrow t} }$ with blank regions from $\mathbf{z}_t$ , using the downsampled mask $\mathbf{M}_{ {\text{latent},t} }$ :

\mathbf{z}_{c,t} = \mathbf{M}_{ {\text{latent},t} } \odot \mathbf{z}_{ {s\rightarrow t} } + (1 - \mathbf{M}_{ {\text{latent},t} }) \odot \mathbf{z}_t

This composite sequence is then noised according to a spatio-temporally varying schedule: each frame $t$ receives an independently sampled noise level, and within each frame, warped and filled regions receive distinct noise levels $\sigma_{ {\text{warped},t} }$ and $\sigma_{ {\text{filled},t} }$ . The resulting noisy latent sequence is fed into $G_{\theta}$ , which is trained to predict the target velocity $\epsilon_t - \mathbf{z}_t$ , forcing it to learn the flow from the noisy composite back to the ground-truth latent sequence. This training objective explicitly encodes the “fill-and-revise” behavior: generating occluded content from pure noise while refining warped content from a partially noised state.

Experiment

On RealEstate10K, achieved state-of-the-art long-term synthesis (200th frame) with PSNR 17.13 and LPIPS 0.352, surpassing SEVA, VMem, and DFoT while maintaining the lowest pose errors (R_dist 0.697, T_dist 0.203), proving superior mitigation of camera drift.
On DL3DV with complex trajectories, maintained leading performance across all metrics in long-term synthesis (PSNR 14.53, R_dist 1.007), outperforming DFoT (PSNR 13.51) and GenWarp (R_dist 1.351), demonstrating robustness in challenging scenes.
Ablation confirmed 3DGS-based caching is critical for long-range fidelity (PSNR 17.13 vs 9.22 without cache) and spatial-temporal noise diffusion balances generation quality (PSNR 17.13) and camera accuracy (R_dist 0.697), outperforming spatial-only or temporal-only variants.

The authors evaluate ablation variants of their model on the RealEstate10K dataset, comparing caching strategies and noise diffusion designs for short-term (50th frame) and long-term (200th frame) novel view synthesis. Results show that using an online optimized 3DGS cache significantly outperforms RGB point cloud caching and no cache, especially in long-term generation, while the full spatial-temporal noise diffusion design achieves the best balance of visual quality and camera pose accuracy across both time horizons.

The authors evaluate their method on the RealEstate10K dataset, reporting superior performance across both short-term and long-term novel view synthesis metrics. Results show their approach achieves the highest PSNR and lowest LPIPS in both settings, while also maintaining the most accurate camera pose estimates with the smallest rotation and translation distances, particularly outperforming baselines in long-term stability.

The authors measure inference latency across pipeline components, showing that ST-Diff with 50 denoising steps dominates total time at 42.5 seconds, while 3D-related operations (TTT3R, 3DGS optimization, warping) collectively add only 8.5 seconds. Results confirm that geometric conditioning is computationally efficient compared to the generative backbone, which accounts for 78% of the total 54.5-second inference time per video chunk.

The authors evaluate their method on the RealEstate10K dataset, reporting superior performance across both short-term and long-term novel view synthesis metrics. Results show their approach achieves the highest PSNR and SSIM while maintaining the lowest LPIPS, FID, and camera pose errors (R_dist and T_dist), particularly excelling in long-term stability where most baselines degrade significantly. This demonstrates the effectiveness of their spatial-temporal noise diffusion strategy in preserving geometric consistency and mitigating cumulative camera drift.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

5달 전

Hanyang Kong Xingyi Yang Xiaoxu Zheng Xinchao Wang

초록

One-sentence Summary

Key Contributions

WorldWarp addresses the fundamental dilemma in long-range video generation where generative models operate in latent space while geometric consistency requires pixel-space 3D adherence, causing failures in occluded areas and complex trajectories by introducing a chunk-based framework that couples an online 3D geometric cache (built via Gaussian Splatting) with a 2D generative refiner to maintain structural grounding.
The framework's core innovation is a Spatio-Temporal Diffusion (ST-Diff) model using a spatio-temporal varying noise schedule that applies full noise to blank regions for generation and partial noise to warped regions for refinement, enabling effective "fill-and-revise" of occlusions while leveraging bidirectional attention conditioned on forward-warped geometric priors.
WorldWarp achieves state-of-the-art geometric consistency and visual fidelity on challenging view extrapolation benchmarks by dynamically updating the 3D cache at each step, preventing irreversible error propagation and demonstrating superior performance over existing methods in long-sequence generation from limited starting images.

Introduction

Method

\mathbf{z}_{c,t} = \mathbf{M}_{ {\text{latent},t} } \odot \mathbf{z}_{ {s\rightarrow t} } + (1 - \mathbf{M}_{ {\text{latent},t} }) \odot \mathbf{z}_t

Experiment

On RealEstate10K, achieved state-of-the-art long-term synthesis (200th frame) with PSNR 17.13 and LPIPS 0.352, surpassing SEVA, VMem, and DFoT while maintaining the lowest pose errors (R_dist 0.697, T_dist 0.203), proving superior mitigation of camera drift.
On DL3DV with complex trajectories, maintained leading performance across all metrics in long-term synthesis (PSNR 14.53, R_dist 1.007), outperforming DFoT (PSNR 13.51) and GenWarp (R_dist 1.351), demonstrating robustness in challenging scenes.
Ablation confirmed 3DGS-based caching is critical for long-range fidelity (PSNR 17.13 vs 9.22 without cache) and spatial-temporal noise diffusion balances generation quality (PSNR 17.13) and camera accuracy (R_dist 0.697), outperforming spatial-only or temporal-only variants.

The authors evaluate their method on the RealEstate10K dataset, reporting superior performance across both short-term and long-term novel view synthesis metrics. Results show their approach achieves the highest PSNR and SSIM while maintaining the lowest LPIPS, FID, and camera pose errors (R_dist and T_dist), particularly excelling in long-term stability where most baselines degrade significantly. This demonstrates the effectiveness of their spatial-temporal noise diffusion strategy in preserving geometric consistency and mitigating cumulative camera drift.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

월드워프: 비동기 영상 디퓨전을 통한 3D 기하학의 전파

Hanyang Kong Xingyi Yang Xiaoxu Zheng Xinchao Wang

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

월드워프: 비동기 영상 디퓨전을 통한 3D 기하학의 전파

Hanyang Kong Xingyi Yang Xiaoxu Zheng Xinchao Wang

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

월드워프: 비동기 영상 디퓨전을 통한 3D 기하학의 전파

Hanyang Kong Xingyi Yang Xiaoxu Zheng Xinchao Wang

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters