HyperAIHyperAI

Command Palette

Search for a command to run...

스테레오월드: 기하학 인지 단안에서 스테레오 비디오 생성으로

초록

XR 장치의 점진적인 보급은 고품질 스테레오 영상에 대한 수요를 크게 증가시켰지만, 그 제작 과정은 여전히 비용이 높고 아티팩트 발생에 취약한 문제가 존재한다. 이 문제를 해결하기 위해 우리는 사전 학습된 영상 생성 모델을 활용하여 단안 영상에서 스테레오 영상으로의 고정밀 변환을 수행하는 엔드투엔드 프레임워크인 StereoWorld를 제안한다. 본 프레임워크는 단안 영상 입력을 공동 조건으로 설정하면서, 3D 구조적 정확성을 보장하기 위해 기하학적 인식을 반영한 정규화 항목을 명시적으로 도입하여 생성 과정을 감독한다. 또한 효율적인 고해상도 합성을 가능하게 하기 위해 공간-시간 틀(tile) 구조를 도입하였다. 대규모 학습 및 평가를 가능하게 하기 위해, 자연스러운 인간의 안구거리(IPD, interpupillary distance)에 맞춰 정렬된 1,100만 프레임 이상을 포함하는 고해상도 스테레오 영상 데이터셋을 구축하였다. 광범위한 실험을 통해 StereoWorld가 기존 방법들을 크게 능가하며, 시각적 정밀도와 기하학적 일관성이 뛰어난 스테레오 영상을 생성함을 입증하였다. 프로젝트 웹페이지는 다음 주소에서 확인할 수 있다: https://ke-xing.github.io/StereoWorld/.

One-sentence Summary

Researchers from Beijing Jiaotong University, Dzine AI, and the University of Toronto propose StereoWorld, an end-to-end framework that converts monocular videos into high-fidelity stereo videos using a pretrained generator with geometry-aware regularization and spatio-temporal tiling, enabling scalable, temporally consistent 3D content generation for XR applications.

Key Contributions

  • StereoWorld addresses the challenge of generating high-fidelity stereo video from monocular footage by introducing an end-to-end diffusion-based framework that repurposes a pretrained video generator, jointly conditioning on input video and enforcing geometric accuracy through disparity and depth supervision.
  • The method incorporates a spatio-temporal tiling strategy to enable efficient, high-resolution stereo video synthesis while maintaining temporal coherence and 3D structural fidelity, overcoming limitations of multi-stage depth-warping approaches that suffer from artifacts and misalignment.
  • To support training and evaluation, the authors curate StereoWorld-11M, the first large-scale, publicly available stereo video dataset with over 11 million frames aligned to natural human interpupillary distance (IPD), and demonstrate state-of-the-art performance with significant improvements in both objective metrics and subjective visual quality.

Introduction

The authors leverage the growing demand for immersive stereo content driven by Extended Reality (XR) devices, where high-quality stereo video is essential but costly to produce due to reliance on specialized dual-camera systems. Prior methods for converting monocular videos to stereo either depend on error-prone 3D scene reconstruction or use a multi-stage depth-warping-inpainting pipeline that disrupts spatial-temporal coherence, leading to artifacts and poor geometric consistency. To overcome these limitations, the authors introduce StereoWorld, an end-to-end diffusion-based framework that adapts a pretrained monocular video generator to directly synthesize high-fidelity stereo videos with explicit geometry-aware supervision through joint RGB and depth modeling. Their approach ensures cross-view alignment and temporal stability while enabling high-resolution generation via a spatio-temporal tiling strategy. A key enabler of this work is the creation of StereoWorld-11M, the first large-scale, high-definition stereo video dataset aligned with human interpupillary distance, which supports both training and realistic evaluation for XR applications.

Dataset

  • The authors use a newly curated dataset called StereoWorld-11M, designed specifically for high-fidelity stereo video generation aligned with human visual perception.
  • The dataset is built from over 100 high-definition Blu-ray side-by-side (SBS) stereo movies collected online, covering diverse genres including animation, realism, war, sci-fi, historical, and drama to ensure visual variety.
  • All videos are standardized into a 1080p resolution, 16:9 aspect ratio, and 24 fps frame rate using SBS format, with left and right views extracted via horizontal cropping and stretching.
  • To meet model input requirements, each video is downscaled to 480p resolution, and 81 frames are uniformly sampled at fixed intervals per clip to boost motion diversity and temporal density.
  • The processed dataset serves as the primary training source for the base model, providing stereo-aligned video clips optimized for natural viewing comfort with baseline distances matching typical human inter-pupillary distance (IPD).

Method

The authors leverage a pretrained text-to-video diffusion model based on the Diffusion Transformer (DiT) architecture as the foundation for their stereo video synthesis framework. The model operates in a latent space derived from a 3D Variational Auto-Encoder (3D VAE), which encodes input videos into compact representations. The DiT backbone integrates self-attention and cross-attention modules to jointly model spatio-temporal dynamics and cross-view interactions. Training follows the Rectified Flow framework, where the forward process defines a linear trajectory between data and noise: zt=(1t)z0+tϵz_t = (1 - t) z_0 + t \epsilonzt=(1t)z0+tϵ, with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)ϵN(0,I). The model learns a velocity field vΘ(zt,t)v_{\Theta}(z_t, t)vΘ(zt,t) that maps noise z1z_1z1 back to data z0z_0z0 via an ODE, optimized using Conditional Flow Matching (CFM) loss to regress the target vector field ut(z0ϵ)u_t(z_0|\epsilon)ut(z0ϵ).

To adapt this monocular generator for stereo synthesis, the authors introduce a conditioning strategy that concatenates left-view and right-view latents along the frame dimension, enabling the model to naturally fuse cross-view information through its existing 3D attention mechanisms. This avoids architectural modifications and preserves computational efficiency. Refer to the framework diagram for a visual overview of the training and inference pipelines.

To enforce geometric consistency, the authors introduce a geometry-aware regularization composed of disparity and depth supervision. During training, ground-truth disparity maps are precomputed using a stereo matching network and used to supervise a lightweight differentiable stereo projector that estimates disparity from the generated right-view latent and the input left-view latent. The disparity loss combines a log-scale global consistency term and an L1 pixel-wise error term: Ldis=Llog+λL1LL1\mathcal{L}_{\mathrm{dis}} = \mathcal{L}_{\log} + \lambda_{\mathrm{L1}} \mathcal{L}_{\mathrm{L1}}Ldis=Llog+λL1LL1. This addresses misalignment and temporal drift between views.

Since disparity supervision is limited to overlapping regions, the authors supplement it with depth supervision. Precomputed depth maps for the right view are encoded into latent space and used to train a separate depth velocity field via the same CFM objective. To avoid gradient conflict between RGB and depth tasks, the final DiT blocks are duplicated into two specialized branches—one for RGB and one for depth—while earlier blocks remain shared to capture joint representations. The overall training objective combines RGB reconstruction, depth consistency, and disparity loss: L=Lrgb+Ldep+λdisLdis\mathcal{L} = \mathcal{L}_{\mathrm{rgb}} + \mathcal{L}_{\mathrm{dep}} + \lambda_{\mathrm{dis}} \mathcal{L}_{\mathrm{dis}}L=Lrgb+Ldep+λdisLdis.

For long video generation, a temporal tiling strategy is employed during training: with probability ppp, the first nnn noisy frames are replaced with clean ground-truth frames to stabilize learning. During inference, long sequences are split into overlapping segments, with the last frames of the prior segment guiding the next to maintain temporal coherence.

For high-resolution inference, a spatial tiling strategy is applied: the input video is encoded into latents, split into overlapping tiles, denoised independently, and then stitched back together with fused overlaps before decoding. This enables scalable generation without memory constraints.

The inference pipeline uses only the shared and RGB DiT blocks, taking a monocular video as input and producing a geometrically consistent right-view video, ready for stereoscopic applications such as anaglyph or XR device rendering.

Experiment

  • Introduced temporal tiling with probabilistic clean frame replacement during training to improve long-range temporal consistency and reduce flickering in generated videos.
  • Applied spatial tiling via block-wise latent diffusion to enable high-resolution video synthesis beyond the 480p training resolution while preserving spatial detail and coherence.
  • Achieved state-of-the-art performance on quantitative metrics: on the test set, obtained 25.98 PSNR, 0.796 SSIM, 0.095 LPIPS, 0.502 IQ-Score, and 0.970 TF-Score, outperforming GenStereo, SVG, and StereoCrafter.
  • Demonstrated superior geometric accuracy with 17.45 EPE and 0.421 D1-all, indicating more accurate disparity estimation and stronger stereo correspondence than baselines.
  • Ablation studies confirmed the importance of depth and disparity supervision, with full model achieving best results (25.98 PSNR, 0.796 SSIM, 17.45 EPE, 0.421 D1-all).
  • Human evaluation with 20 participants showed significant improvements across all perceptual metrics: Stereo Effect (4.8), Visual Quality (4.7), Binocular Consistency (4.9), and Temporal Consistency (4.8), validating enhanced 3D immersion and visual fidelity.

The authors evaluate their method against three baselines using quantitative metrics for visual fidelity and geometric accuracy. Results show their approach achieves the highest PSNR, SSIM, IQ-Score, and lowest LPIPS, EPE, and D1-all, indicating superior image quality and stereo correspondence. Their method also matches or exceeds baselines in temporal stability, as reflected in TF-Score.

The authors conduct a human evaluation across four perceptual dimensions and find that their method achieves the highest scores in all categories: Stereo Effect, Visual Quality, Binocular Consistency, and Temporal Consistency. Results indicate that participants perceived their generated stereo videos as more immersive, visually clear, spatially aligned, and temporally stable compared to all baselines.

The authors evaluate the impact of depth and disparity supervision in their model, showing that combining both components yields the best performance across all metrics. Results indicate that disparity supervision alone improves geometric accuracy more than depth supervision, but the full model with both signals achieves the highest PSNR, SSIM, and lowest LPIPS, EPE, and D1-all scores. This demonstrates that the two supervision signals are complementary and essential for generating geometrically accurate and visually faithful stereo videos.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp