D-AR: Diffusion via Autoregressive Models

This paper presents Diffusion via Autoregressive models (D-AR), a newparadigm recasting the image diffusion process as a vanilla autoregressiveprocedure in the standard next-token-prediction fashion. We start by designingthe tokenizer that converts images into sequences of discrete tokens, wheretokens in different positions can be decoded into different diffusion denoisingsteps in the pixel space. Thanks to the diffusion properties, these tokensnaturally follow a coarse-to-fine order, which directly lends itself toautoregressive modeling. Therefore, we apply standard next-token prediction onthese tokens, without modifying any underlying designs (either causal masks ortraining/inference strategies), and such sequential autoregressive tokengeneration directly mirrors the diffusion procedure in image space. That is,once the autoregressive model generates an increment of tokens, we can directlydecode these tokens into the corresponding diffusion denoising step in thestreaming manner. Our pipeline naturally reveals several intriguing properties,for example, it supports consistent previews when generating only a subset oftokens and enables zero-shot layout-controlled synthesis. On the standardImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbonewith 256 discrete tokens. We hope our work can inspire future research onunified autoregressive architectures of visual synthesis, especially with largelanguage models. Code and models will be available athttps://github.com/showlab/D-AR