Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Autoregressive large language models (LLMs) have unified a vast range oflanguage tasks, inspiring preliminary efforts in autoregressive videogeneration. Existing autoregressive video generators either diverge fromstandard LLM architectures, depend on bulky external text encoders, or incurprohibitive latency due to next-token decoding. In this paper, we introduceLumos-1, an autoregressive video generator that retains the LLM architecturewith minimal architectural modifications. To inject spatiotemporal correlationsin LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose itsimbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPEscheme that preserves the original textual RoPE while providing comprehensivefrequency spectra and scaled 3D positions for modeling multimodalspatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategythat obeys intra-frame bidirectionality and inter-frame temporal causality.Based on this dependency strategy, we identify the issue of frame-wise lossimbalance caused by spatial information redundancy and solve it by proposingAutoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporaltube masking during training with a compatible inference-time masking policy toavoid quality degradation. By using memory-efficient training techniques, wepre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 onGenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Codeand models are available at https://github.com/alibaba-damo-academy/Lumos.