Yume: An Interactive World Generation Model

Yume aims to use images, text, or videos to create an interactive, realistic,and dynamic world, which allows exploration and control using peripheraldevices or neural signals. In this report, we present a preview version of\method, which creates a dynamic world from an input image and allowsexploration of the world using keyboard actions. To achieve this high-fidelityand interactive video world generation, we introduce a well-designed framework,which consists of four main components, including camera motion quantization,video generation architecture, advanced sampler, and model acceleration. First,we quantize camera motions for stable training and user-friendly interactionusing keyboard inputs. Then, we introduce the Masked Video DiffusionTransformer~(MVDT) with a memory module for infinite video generation in anautoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM)and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE)are introduced to the sampler for better visual quality and more precisecontrol. Moreover, we investigate model acceleration by synergisticoptimization of adversarial distillation and caching mechanisms. We use thehigh-quality world exploration dataset \sekai to train \method, and it achievesremarkable results in diverse scenes and applications. All data, codebase, andmodel weights are available on https://github.com/stdstu12/YUME. Yume willupdate monthly to achieve its original goal. Project page:https://stdstu12.github.io/YUME-Project/.