Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

Latent-based image generative models, such as Latent Diffusion Models (LDMs)and Mask Image Models (MIMs), have achieved notable success in image generationtasks. These models typically leverage reconstructive autoencoders like VQGANor VAE to encode pixels into a more compact latent space and learn the datadistribution in the latent space instead of directly from pixels. However, thispractice raises a pertinent question: Is it truly the optimal choice? Inresponse, we begin with an intriguing observation: despite sharing the samelatent space, autoregressive models significantly lag behind LDMs and MIMs inimage generation. This finding contrasts sharply with the field of NLP, wherethe autoregressive model GPT has established a commanding presence. To addressthis discrepancy, we introduce a unified perspective on the relationshipbetween latent space and generative models, emphasizing the stability of latentspace in image generative modeling. Furthermore, we propose a simple buteffective discrete image tokenizer to stabilize the latent space for imagegenerative modeling. Experimental results show that image autoregressivemodeling with our tokenizer (DiGIT) benefits both image understanding and imagegeneration with the next token prediction principle, which is inherentlystraightforward for GPT models but challenging for other generative models.Remarkably, for the first time, a GPT-style autoregressive model for imagesoutperforms LDMs, which also exhibits substantial improvement akin to GPT whenscaling up model size. Our findings underscore the potential of an optimizedlatent space and the integration of discrete tokenization in advancing thecapabilities of image generative models. The code is available athttps://github.com/DAMO-NLP-SG/DiGIT.