HyperAIHyperAI

Command Palette

Search for a command to run...

Yume-1.5: 텍스트 제어형 상호작용 세계 생성 모델

Xiaofeng Mao Zhen Li Chuanhao Li Xiaojie Xu Kaining Ying Tong He Jiangmiao Pang Yu Qiao Kaipeng Zhang

초록

최근의 접근 방식들은 확산 모델을 활용하여 상호작용 가능하고 탐색 가능한 세계를 생성하는 데 있어 큰 잠재력을 보여주었다. 그러나 이러한 대부분의 방법들은 파라미터 크기가 지나치게 크고, 긴 추론 단계에 의존하며, 역사적 컨텍스트가 급격히 증가하는 등의 심각한 도전 과제에 직면해 있어 실시간 성능에 큰 제약을 받고 있으며, 텍스트 제어 기반의 생성 기능이 부족하다. 이러한 문제를 해결하기 위해, 본 연구에서는 단일 이미지 또는 텍스트 프롬프트로부터 현실적이고 상호작용 가능하며 지속적인 세계를 생성할 수 있도록 설계된 새로운 프레임워크인 \method를 제안한다. \method는 키보드 기반의 탐색을 지원하는 철저히 설계된 프레임워크를 통해 이를 실현한다. 이 프레임워크는 세 가지 핵심 구성 요소로 구성된다: (1) 통합적 컨텍스트 압축과 선형 어텐션을 통합한 장비 영상 생성 프레임워크; (2) 양방향 어텐션 다이스틸레이션과 개선된 텍스트 임베딩 기법을 기반으로 한 실시간 스트리밍 가속 전략; (3) 세계 내 이벤트를 텍스트로 제어하여 생성하는 방법. 본 연구의 코드베이스는 보충 자료에 공개되어 있다.

One-sentence Summary

The authors from Shanghai AI Laboratory, Fudan University, and Shanghai Innovation Institute propose Yume1.5, a lightweight framework enabling real-time, text- and image-controlled generation of persistent, explorable virtual worlds via continuous keyboard input, leveraging unified context compression, bidirectional attention distillation, and enhanced text embeddings to overcome limitations of prior diffusion-based methods in speed, memory, and controllability.

Key Contributions

  • Yume1.5 introduces Joint Temporal-Spatial-Channel Modeling (TSCM), a novel compression framework that enables stable, long-context video generation by jointly compressing historical frames along temporal, spatial, and channel dimensions, significantly reducing memory usage and maintaining consistent inference speed regardless of context length.

  • The framework integrates a bidirectional attention distillation strategy with a self-forcing-like training paradigm, leveraging TSCM to replace the KV cache and reduce error accumulation, thereby accelerating inference while preserving visual quality and enabling real-time, continuous world exploration.

  • Yume1.5 achieves text-controlled generation of dynamic world events through a mixed-dataset training approach and architectural design, allowing users to edit and generate new events via text prompts—demonstrated on a re-annotated dataset with event-focused captions—while supporting keyboard-based navigation across image- and text-to-world generation modes.

Introduction

The authors leverage video diffusion models to enable interactive, persistent virtual world generation—critical for applications in immersive entertainment, simulation, and virtual embodiment—where users can explore dynamic environments in real time. Prior work faces key limitations: poor generalizability beyond game-like settings, high latency due to slow diffusion inference, and lack of text-based control for event generation, often relying only on mouse or keyboard inputs with limited expressiveness. To overcome these, the authors introduce Yume1.5, which achieves real-time, autoregressive generation of infinite video worlds from a single image or text prompt. Their main contributions are threefold: (1) Joint Temporal-Spatial-Channel Modeling (TSCM), which compresses historical frames across time, space, and channels to maintain long-context coherence without memory explosion; (2) a self-forcing-inspired acceleration method that reduces sampling steps from 50 to 4 while minimizing error accumulation; and (3) text-controlled event generation via a mixed-dataset training strategy and architectural design, enabling dynamic scene evolution through natural language.

Dataset

  • The dataset is composed of three main components: a Real-world Dataset, a Synthetic Dataset, and a specialized Event Dataset, each contributing to balanced performance in realistic motion control, general video quality, and event-specific generation.

  • The Real-world Dataset is derived from Sekai-Real-HQ, a high-quality subset of the Sekai dataset featuring long walking video clips with detailed camera motion trajectories and semantic labels. The authors convert camera trajectory data into discrete keyboard and mouse control signals using a method from [21], mapping them to action vocabularies for camera movement (e.g., →, ↑, ↘) and human-like camera motion (e.g., W, A, W+A). Additionally, the dataset is re-annotated: original scene descriptions are preserved for Text-to-Video (T2V) training, while InternVL3-78B generates new event-focused captions for Image-to-Video (I2V) training to improve event-driven generation.

  • The Synthetic Dataset is built from Openvid, with 80,000 diverse captions selected via similarity-based deduplication and random sampling. Using Wan 2.1 14B, 80,000 videos at 720p resolution are synthesized. Quality is assessed with VBench, and the top 50,000 videos are retained to prevent overfitting and maintain general video generation ability, primarily used for T2V training.

  • The Event Dataset consists of 10,000 first-person images paired with user-generated descriptions across four categories: urban daily life, sci-fi, fantasy, and weather phenomena. These are used to synthesize 10,000 image-to-video sequences with Wan 2.2 14B-I2V. After manual screening, 4,000 high-fidelity videos are selected to enhance semantic alignment in complex, event-driven scenarios, and are used exclusively for T2V training.

  • The model is trained using a mixture of these datasets: the Real-world and Event datasets are used for both T2V and I2V tasks, while the Synthetic Dataset supports T2V to preserve generalization. Training employs a decomposed event and action description pipeline, with adaptive history token downsampling and chunk-based autoregressive inference for efficient memory management.

Method

The authors leverage a diffusion-based framework for generating interactive and explorable worlds, built upon a DiT (Diffusion Transformer) backbone. The overall architecture supports both text-to-video and image-to-video generation, with a unified approach to conditioning. For text-to-video, the model takes a text embedding ccc and a noise tensor zRC×ft×h×wz \in \mathbb{R}^{C \times f_t \times h \times w}zRC×ft×h×w as input. For image-to-video, a conditional image or video zcz_czc is zero-padded to match the dimensions of zzz, and a binary mask McM_cMc is used to identify preserved regions. The conditional input is fused with the noise via Mczc+(1Mc)zM_c \cdot z_c + (1 - M_c) \cdot zMczc+(1Mc)z, which is then processed by the DiT backbone. The text encoding strategy is distinct from prior work; the caption is decomposed into an Event Description and an Action Description, which are processed separately by a T5 encoder and then concatenated. This design allows for efficient precomputation of action descriptions, reducing computational overhead during inference. The model is trained using the Rectified Flow loss.

The framework incorporates a long-video generation method that addresses the challenges of large context and slow inference. This is achieved through a combination of temporal-spatial and channel compression. For temporal-spatial compression, historical frames zcz_czc are downsampled using a multi-rate Patchify scheme. The compression rate varies based on the frame's temporal distance from the current prediction: frames from t1t-1t1 to t2t-2t2 are downsampled by (1, 2, 2), frames from t3t-3t3 to t6t-6t6 by (1, 4, 4), and so on, with the initial frame also using (1, 2, 2). This is implemented by interpolating the Patchify weights. The compressed representation z^c\hat{z}_cz^c is then concatenated with the prediction frame z^d\hat{z}_dz^d, which is processed with a fixed (1, 2, 2) downsampling rate, and the combined tensor is fed into the DiT block. For channel compression, the historical frames zcz_czc are passed through a Patchify with a compression rate of (8, 4, 4) and reduced to 96 channels, resulting in zlinearz_{\text{linear}}zlinear. This compressed representation is fed into the DiT block. After the video tokens zlz^lzl pass through the cross-attention layer, they are processed by a fully connected layer for channel reduction. The predicted frames zplz_p^lzpl are then concatenated with zlinearz_{\text{linear}}zlinear, and the combined tokens zfusz_{\text{fus}}zfus are fused via a linear attention layer to produce zfuslz_{\text{fus}}^lzfusl. This output is then passed through another fully connected layer to restore the channel dimension and added element-wise to zlz_lzl for feature fusion.

To accelerate real-time inference, the authors employ a bidirectional attention distillation strategy. This method converts the multi-step diffusion model into a few-step generator by minimizing the expected KL divergence between the real and fake data distributions. The process involves a Real Model (teacher) and a Fake Model (student). The generator GθG_\thetaGθ samples previous frames from its own distribution and uses them as context to generate new predicted frames, forming a clean video sequence z0z_0z0. The key innovation is that the Fake Model is optimized to match the trajectory of the Real Model via a distribution matching gradient, using model-predicted data rather than real data as conditioning. This approach mitigates the train-inference discrepancy and error accumulation in long videos. The generator is initialized with weights from a foundation model trained on a mixed dataset using an alternating training strategy for text-to-video and image-to-video tasks. The resulting model is capable of generating realistic, interactive, and continuous worlds from a single image or text prompt, supporting keyboard-based exploration.

Experiment

  • Conducted foundation model training on Wan2.2-5B with 704×1280 resolution, 16 FPS, batch size 40, Adam optimizer (1e-5 LR), and 10,000 iterations on A100 GPUs; followed by 600 iterations of Self-Forcing with TSCM under identical settings.
  • Evaluated on Yume-Bench with 544×960 resolution, 16 FPS, 96 frames, and 4 inference steps; used six metrics including instruction following, subject consistency, background consistency, motion smoothness, aesthetic quality, and imaging quality.
  • Yume1.5 achieved an instruction-following score of 0.836 on image-to-video generation, significantly outperforming Wan-2.1 and MatrixGame, which showed limited real-world controllability.
  • In long-video generation, the model with Self-Forcing and TSCM maintained stable aesthetic (0.523) and image quality (0.601) scores across the final video segments, surpassing the baseline (aesthetic: 0.442, image quality: 0.542).
  • TSCM improved instruction following while stabilizing inference time; at over 8 video blocks, inference time per step remained constant, outperforming full-context input and spatial compression methods.
  • Yume1.5 generated videos at 12 fps at 540p resolution using a single A100 GPU.
  • Limitations include artifacts in motion direction (e.g., backward-moving vehicles) and performance degradation in high-density scenes, attributed to the 5B model size; MoE architectures are proposed as a future solution.

Results show that Yume1.5 outperforms Wan-2.1 and MatrixGame across multiple metrics, achieving the highest instruction-following score of 0.836 and significantly improving aesthetic and imaging quality compared to the baseline models. The model also demonstrates the fastest inference speed, completing generation in just 8 seconds, while maintaining strong consistency in subject, background, and motion smoothness.

Results show that the TSCM model achieves superior performance across multiple metrics compared to the Spatial Compression model. Specifically, TSCM scores 0.836 in Instruction Following, 0.945 in Background Consistency, and 0.985 in Motion Smoothness, outperforming the Spatial Compression model in all evaluated categories.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp