Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Autoregressive (AR) modeling, known for its next-token prediction paradigm,underpins state-of-the-art language and visual generative models.Traditionally, a ``token'' is treated as the smallest prediction unit, often adiscrete symbol in language or a quantized patch in vision. However, theoptimal token definition for 2D image structures remains an open question.Moreover, AR models suffer from exposure bias, where teacher forcing duringtraining leads to error accumulation at inference. In this paper, we proposexAR, a generalized AR framework that extends the notion of a token to an entityX, which can represent an individual patch token, a cell (a ktimes kgrouping of neighboring patches), a subsample (a non-local grouping of distantpatches), a scale (coarse-to-fine resolution), or even a whole image.Additionally, we reformulate discrete token classification ascontinuous entity regression, leveraging flow-matching methods at eachAR step. This approach conditions training on noisy entities instead of groundtruth tokens, leading to Noisy Context Learning, which effectively alleviatesexposure bias. As a result, xAR offers two key advantages: (1) it enablesflexible prediction units that capture different contextual granularity andspatial structures, and (2) it mitigates exposure bias by avoiding reliance onteacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B(172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20times fasterinference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24,running 2.2times faster than the previous best-performing model withoutrelying on vision foundation modules (eg, DINOv2) or advanced guidanceinterval sampling.