HyperAI
Back to Headlines

Token-Shuffle Revolutionizes High-Resolution Image Generation with Autoregressive Models

11 days ago

Autoregressive (AR) models have been the go-to choice for language generation tasks but have traditionally lagged behind diffusion-based models in image synthesis. This is primarily due to the significant number of image tokens required by AR models, which affects their training and inference efficiency and limits the resolution of generated images. To tackle this issue, a team of researchers led by Xu Ma has introduced Token-Shuffle, a new and straightforward method designed to reduce the number of image tokens used in Transformer models. The core idea behind Token-Shuffle is the observation that visual vocabularies in Multimodal Large Language Models (MLLMs) often exhibit dimensional redundancy. Specifically, low-dimensional visual codes from visual encoders are mapped to high-dimensional language vocabularies, leading to inefficiencies. To address this, Token-Shuffle employs two main operations: token-shuffle and token-unshuffle. The token-shuffle operation merges spatially local tokens along the channel dimension, thereby decreasing the total number of input tokens. Conversely, token-unshuffle restores the spatial arrangement of the inferred tokens after they pass through the Transformer blocks, ensuring the output retains the correct visual structure. By integrating these operations into MLLMs and jointly training with textual prompts, Token-Shuffle eliminates the need for an additional pretrained text encoder. This approach not only supports extremely high-resolution image synthesis in a unified next-token prediction framework but also maintains efficient training and inference processes. The researchers tested their method using a 2.7 billion parameter model, achieving groundbreaking results in AR text-to-image generation. They successfully generated images at a resolution of 2048x2048, which was previously unattainable with AR models. When evaluated on the GenAI-benchmark, their model scored 0.77 on hard prompts, surpassing both the AR model LlamaGen by 0.18 and the diffusion model LDM by 0.15. Large-scale human evaluations further validated the effectiveness of Token-Shuffle. Participants found that the generated images excelled in text alignment, had fewer visual flaws, and exhibited superior visual appearance compared to other models. These findings suggest that Token-Shuffle can significantly enhance the capabilities of MLLMs in high-resolution image generation, making it a promising foundational design for future work in this area. The introduction of Token-Shuffle represents a significant step forward in the field of generative models, particularly for those looking to leverage AR models for high-resolution image synthesis. By addressing the longstanding issues of token redundancy and inefficiency, this method opens up new possibilities for improving the quality and resolution of synthesized images. The research team's success demonstrates the potential of AR models in challenging the dominance of diffusion-based models in this domain. In summary, Token-Shuffle is a simple yet effective technique that optimizes the use of image tokens in AR models, enabling them to generate high-resolution images efficiently. Its successful implementation and impressive performance metrics highlight its value and promise for advancing the field of multimodal generative AI.

Related Links