Randomized Autoregressive Visual Generation

This paper presents Randomized AutoRegressive modeling (RAR) for visualgeneration, which sets a new state-of-the-art performance on the imagegeneration task while maintaining full compatibility with language modelingframeworks. The proposed RAR is simple: during a standard autoregressivetraining process with a next-token prediction objective, the inputsequence-typically ordered in raster form-is randomly permuted into differentfactorization orders with a probability r, where r starts at 1 and linearlydecays to 0 over the course of training. This annealing training strategyenables the model to learn to maximize the expected likelihood over allfactorization orders and thus effectively improve the model's capability ofmodeling bidirectional contexts. Importantly, RAR preserves the integrity ofthe autoregressive modeling framework, ensuring full compatibility withlanguage modeling while significantly improving performance in imagegeneration. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48,not only surpassing prior state-of-the-art autoregressive image generators butalso outperforming leading diffusion-based and masked transformer-basedmethods. Code and models will be made available athttps://github.com/bytedance/1d-tokenizer