HyperAIHyperAI
8 days ago

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion
  Transformer
Abstract

The recent surge of interest in comprehensive multimodal models hasnecessitated the unification of diverse modalities. However, the unificationsuffers from disparate methodologies. Continuous visual generation necessitatesthe full-sequence diffusion-based approach, despite its divergence from theautoregressive modeling in the text domain. We posit that autoregressivemodeling, i.e., predicting the future based on past deterministic experience,remains crucial in developing both a visual generation model and a potentialunified multimodal model. In this paper, we explore an interpolation betweenthe autoregressive modeling and full-parameters diffusion to model visualinformation. At its core, we present ACDiT, an Autoregressive blockwiseConditional Diffusion Transformer, where the block size of diffusion, i.e., thesize of autoregressive units, can be flexibly adjusted to interpolate betweentoken-wise autoregression and full-sequence diffusion. ACDiT is easy toimplement, as simple as creating a Skip-Causal Attention Mask (SCAM) duringtraining. During inference, the process iterates between diffusion denoisingand autoregressive decoding that can make full use of KV-Cache. We verify theeffectiveness of ACDiT on image and video generation tasks. We also demonstratethat benefitted from autoregressive modeling, ACDiT can be seamlessly used invisual understanding tasks despite being trained on the diffusion objective.The analysis of the trade-off between autoregressive modeling and diffusiondemonstrates the potential of ACDiT to be used in long-horizon visualgeneration tasks. These strengths make it promising as the backbone of futureunified models.

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer | Latest Papers | HyperAI