2달 전

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi

초록

비디오 확산 Transformer(Video Diffusion Transformer) 의 최근 발전으로 사용자는 확장된 시간 범위 내에서 생성된 환경을 탐험할 수 있는 상호작용형 게임 월드 모델(interactive gaming world model) 이 가능해졌습니다. 그러나 기존 접근법은 정밀한 동작 제어와 장시간 범위(long-horizon) 3D 일관성 측면에서 한계를 보입니다. 대부분의 기존 연구는 사용자 동작을 추상적인 조건부 신호로만 처리하여, 동작이 3D 세계 내에서 상대적인 카메라 운동을 유발하고 이것이 전역 카메라 포즈(global camera pose) 로 누적된다는 동작과 3D 세계 간의 근본적인 기하학적 결합을 간과해 왔습니다. 본 논문에서는 카메라 포즈를 통일된 기하학적 표현으로 정의하여 즉각적인 동작 제어와 장기적 3D 일관성을 통합적으로 정립합니다. 먼저, 물리 기반의 연속 동작 공간(physics-based continuous action space) 을 정의하고 사용자 입력을 리 대수(Lie algebra) 로 표현하여 정밀한 6 자유도(6-DoF) 카메라 포즈를 도출하며, 이를 카메라 임베더(camera embedder) 를 통해 생성 모델에 주입하여 정확한 동작 정렬을 보장합니다. 둘째, 전역 카메라 포즈를 공간 인덱스로 활용하여 관련 과거 관측 데이터를 검색함으로써, 장시간 범위 내비게이션 동안 기하학적으로 일관된 위치 재방문을 가능하게 합니다. 본 연구를 지원하기 위해 카메라 궤적과 텍스트 설명이 주석된 3,000 분 분량의 실제 인간 게임 플레이 데이터셋을 대규모로 구축하였습니다. 광범위한 실험 결과, 제안된 접근법은 동작 제어 능력, 장시간 범위 시각 품질, 3D 공간 일관성 측면에서 최신 상호작용형 게임 월드 모델을 크게 앞서는 성능을 입증하였습니다.

One-sentence Summary

Researchers from KAIST, Adobe Research, and MAUM AI introduce WorldCam, a foundation model that unifies precise action control and long-horizon 3D consistency by mapping user inputs to Lie algebra-based camera poses, outperforming prior methods in interactive gaming scenarios through a novel pose-indexed memory retrieval system.

Key Contributions

The paper introduces a physics-based continuous action space that translates user inputs into precise 6-DoF camera poses using Lie algebra, which are then injected into a video diffusion transformer via a camera embedder to ensure accurate action alignment.
A retrieval mechanism is presented that uses global camera poses as spatial indices to fetch relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation.
The authors release a large-scale dataset containing 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions to support the training and evaluation of interactive gaming world models.

Introduction

Interactive gaming world models built on video diffusion transformers aim to generate playable environments, yet they struggle with precise action control and maintaining 3D consistency over long horizons. Prior approaches often treat user inputs as abstract signals or rely on simplified linear approximations, which fail to capture the complex geometric coupling between actions and camera motion in a 3D space. The authors introduce WorldCam, a framework that establishes camera pose as a unifying geometric representation to simultaneously ground immediate action control and long-term spatial consistency. They achieve this by translating user inputs into precise 6-DoF poses using Lie algebra and leveraging these poses to retrieve past observations for geometrically coherent revisiting of locations. Additionally, the team addresses data scarcity by releasing WorldCam-50h, a large-scale dataset of authentic human gameplay annotated with camera trajectories and text descriptions.

Dataset

Dataset Composition and Sources: The authors introduce WorldCam-50h, a large-scale dataset of human gameplay videos designed to capture authentic action dynamics. Data is sourced from three games: Counter-Strike (closed-licensed), and Xonotic and Unvanquished (open-licensed under CC BY-SA 2.5 and GPL v3). The collection focuses on single-player exploration within static environments to ensure reproducibility and visual diversity.
Key Details for Each Subset: The dataset comprises over 100 videos per game, with each video averaging 8 minutes to yield approximately 17 hours of footage per title. Participants were instructed to perform diverse behaviors such as navigation, rapid camera movements, and revisiting locations. The total collection amounts to roughly 50 hours of gameplay.
Model Usage and Training Strategy: The authors utilize the entire dataset for training foundational gaming world models. Unlike prior works that discard textual guidance, this approach leverages detailed captions to maintain frame quality and scene style during the training process.
Processing and Metadata Construction:
- Captioning: Each training video chunk is annotated with detailed textual descriptions generated by Qwen2.5-VL-7B. These prompts focus on global layout, visual themes, and ambient environmental conditions.
- Camera Annotation: Global camera pose information, including intrinsics and extrinsics, is extracted for every one-minute segment using ViPE.
- Filtering: To ensure data quality, the authors apply a filtering step that removes camera pose estimates with unrealistically large translation magnitudes.

Method

The authors propose WorldCam, an interactive 3D world model designed to autoregressively generate video sequences that accurately follow user actions while maintaining long-term spatial consistency. The system takes an initial RGB observation, a text prompt, and a sequence of user actions as input to generate future frames.

Refer to the framework diagram below for an overview of the system architecture, which integrates action-to-camera mapping, camera-controlled generation, and a pose-anchored memory mechanism.

The core generative backbone is a pretrained Video Diffusion Transformer (DiT), specifically Wan-2.1-T2V. Given an input video $V$ , a VAE encoder maps it to a latent sequence $\mathbf{z}_0$ . The DiT learns to predict the velocity field that transports noisy latents $\mathbf{z}_t$ toward the clean latents $\mathbf{z}_0$ using a flow matching objective:

L_{\mathrm{FM}} = \mathbb{E}_{\mathbf{z}_0, c_{\mathrm{text}}, t} \Big[ \big\| v_{\theta}(\mathbf{z}_t, c_{\mathrm{text}}, t) - \frac{\mathbf{z}_0 - \mathbf{z}_t}{1 - t} \big\|_2^2 \Big].

To ensure precise control over camera motion, the authors define the action space in the Lie algebra $\mathfrak{se}(3)$ . User actions are represented as twist vectors $A_i = [\mathbf{v}_i; \boldsymbol{\omega}_i] \in \mathbb{R}^6$ , containing linear and angular velocities. These are converted into relative camera poses $\Delta P_i \in SE(3)$ via the matrix exponential map:

\Delta P_i = \exp(\hat{A}_i) = \left[ \begin{array}{ll} \Delta R_i & \Delta t_i \\ \mathbf{0}^\top & 1 \end{array} \right],

where $\hat{A}_i$ is the $4 \times 4$ matrix representation of the twist. This formulation jointly integrates linear and angular velocities on the $SE(3)$ manifold, avoiding the geometric inconsistencies found in decoupled linear approximations.

The derived camera poses are then used to condition the generative model. The poses are converted into Plücker embeddings $\hat{P} \in \mathbb{R}^{F \times 6}$ to provide explicit view-dependent geometric information. A lightweight camera embedding module $c_{\phi}$ consisting of two MLP layers processes these embeddings. To align with the temporally compressed latent sequence, $r$ consecutive Plücker embeddings are concatenated for each latent frame. The resulting camera embeddings are added to the DiT features $\mathbf{d}$ after each self-attention layer:

\mathbf{d} \gets \mathbf{d} + c_{\phi}(\hat{\mathbf{p}}).

To maintain 3D consistency over long horizons, the system employs a pose-anchored long-term memory pool $\mathcal{M}$ . This pool stores previously generated latents along with their global camera poses. The global pose $P_i^{\mathrm{global}}$ is computed by accumulating relative poses. During generation, a hierarchical retrieval strategy is used to find relevant context. First, the system selects the top- $K$ candidates based on translation distance to the current position. From these, it further selects $L$ entries whose viewing directions are most aligned with the current orientation, measured by the trace of the relative rotation matrix. These retrieved latents are concatenated with the current input sequence, and their associated poses are realigned and injected into the DiT to enforce spatial coherence.

Finally, the model utilizes a progressive autoregressive inference strategy. A progressive per-frame noise schedule assigns monotonically increasing noise levels to latent frames within each denoising window. This provides a low-noise anchor in early frames while keeping future frames at higher noise levels for correction. During inference, the latent sequence is shifted forward after completing all denoising stages, with the earliest frame evicted and a new pure-noise latent appended. An attention sink mechanism is also incorporated to stabilize attention and preserve frame fidelity during long rollouts.

Experiment

Comparison with state-of-the-art interactive gaming and camera-controlled models validates that the proposed method achieves superior action controllability, visual quality, and 3D consistency over long-horizon sequences, whereas baselines suffer from visual drift, coarse control, or inability to maintain geometric coherence.
Qualitative analysis confirms the model faithfully follows complex user inputs and preserves consistent 3D scene structures even when revisiting previously seen locations, while prior methods often fail to maintain geometry beyond short generation windows.
Ablation studies demonstrate that Lie algebra-based action-to-camera mapping provides more accurate motion control than linear approximations, and that increasing long-term memory latents alongside attention sinks significantly enhances 3D consistency and reduces long-horizon error drift.
Human evaluation and quantitative metrics collectively verify that the approach outperforms existing baselines across all key aspects, establishing it as a robust solution for interactive 3D world modeling.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi

초록

One-sentence Summary

Key Contributions

The paper introduces a physics-based continuous action space that translates user inputs into precise 6-DoF camera poses using Lie algebra, which are then injected into a video diffusion transformer via a camera embedder to ensure accurate action alignment.
A retrieval mechanism is presented that uses global camera poses as spatial indices to fetch relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation.
The authors release a large-scale dataset containing 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions to support the training and evaluation of interactive gaming world models.

Introduction

Dataset

Dataset Composition and Sources: The authors introduce WorldCam-50h, a large-scale dataset of human gameplay videos designed to capture authentic action dynamics. Data is sourced from three games: Counter-Strike (closed-licensed), and Xonotic and Unvanquished (open-licensed under CC BY-SA 2.5 and GPL v3). The collection focuses on single-player exploration within static environments to ensure reproducibility and visual diversity.
Key Details for Each Subset: The dataset comprises over 100 videos per game, with each video averaging 8 minutes to yield approximately 17 hours of footage per title. Participants were instructed to perform diverse behaviors such as navigation, rapid camera movements, and revisiting locations. The total collection amounts to roughly 50 hours of gameplay.
Model Usage and Training Strategy: The authors utilize the entire dataset for training foundational gaming world models. Unlike prior works that discard textual guidance, this approach leverages detailed captions to maintain frame quality and scene style during the training process.
Processing and Metadata Construction:
- Captioning: Each training video chunk is annotated with detailed textual descriptions generated by Qwen2.5-VL-7B. These prompts focus on global layout, visual themes, and ambient environmental conditions.
- Camera Annotation: Global camera pose information, including intrinsics and extrinsics, is extracted for every one-minute segment using ViPE.
- Filtering: To ensure data quality, the authors apply a filtering step that removes camera pose estimates with unrealistically large translation magnitudes.

Method

Refer to the framework diagram below for an overview of the system architecture, which integrates action-to-camera mapping, camera-controlled generation, and a pose-anchored memory mechanism.

L_{\mathrm{FM}} = \mathbb{E}_{\mathbf{z}_0, c_{\mathrm{text}}, t} \Big[ \big\| v_{\theta}(\mathbf{z}_t, c_{\mathrm{text}}, t) - \frac{\mathbf{z}_0 - \mathbf{z}_t}{1 - t} \big\|_2^2 \Big].

\Delta P_i = \exp(\hat{A}_i) = \left[ \begin{array}{ll} \Delta R_i & \Delta t_i \\ \mathbf{0}^\top & 1 \end{array} \right],

\mathbf{d} \gets \mathbf{d} + c_{\phi}(\hat{\mathbf{p}}).

Experiment

Comparison with state-of-the-art interactive gaming and camera-controlled models validates that the proposed method achieves superior action controllability, visual quality, and 3D consistency over long-horizon sequences, whereas baselines suffer from visual drift, coarse control, or inability to maintain geometric coherence.
Qualitative analysis confirms the model faithfully follows complex user inputs and preserves consistent 3D scene structures even when revisiting previously seen locations, while prior methods often fail to maintain geometry beyond short generation windows.
Ablation studies demonstrate that Lie algebra-based action-to-camera mapping provides more accurate motion control than linear approximations, and that increasing long-term memory latents alongside attention sinks significantly enhances 3D consistency and reduces long-horizon error drift.
Human evaluation and quantitative metrics collectively verify that the approach outperforms existing baselines across all key aspects, establishing it as a robust solution for interactive 3D world modeling.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

WorldCam: 카메라 포즈를 통합 기하학적 표현으로 활용한 대화형 자기회귀 3D 게임 월드

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

WorldCam: 카메라 포즈를 통합 기하학적 표현으로 활용한 대화형 자기회귀 3D 게임 월드

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

WorldCam: 카메라 포즈를 통합 기하학적 표현으로 활용한 대화형 자기회귀 3D 게임 월드

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi

Jisu Nam Yicong Hong Chun-Hao Paul Huang Feng Liu JoungBin Lee Jiyoung Kim Siyoon Jin Yunsung Lee Jaeyoon Jung Suhwan Choi