6 months ago

Jinyi Hu Shengding Hu Yuxuan Song Yufei Huang Mingxuan Wang Hao Zhou Zhiyuan Liu Wei-Ying Ma Maosong Sun

Abstract

The recent surge of interest in comprehensive multimodal models hasnecessitated the unification of diverse modalities. However, the unificationsuffers from disparate methodologies. Continuous visual generation necessitatesthe full-sequence diffusion-based approach, despite its divergence from theautoregressive modeling in the text domain. We posit that autoregressivemodeling, i.e., predicting the future based on past deterministic experience,remains crucial in developing both a visual generation model and a potentialunified multimodal model. In this paper, we explore an interpolation betweenthe autoregressive modeling and full-parameters diffusion to model visualinformation. At its core, we present ACDiT, an Autoregressive blockwiseConditional Diffusion Transformer, where the block size of diffusion, i.e., thesize of autoregressive units, can be flexibly adjusted to interpolate betweentoken-wise autoregression and full-sequence diffusion. ACDiT is easy toimplement, as simple as creating a Skip-Causal Attention Mask (SCAM) duringtraining. During inference, the process iterates between diffusion denoisingand autoregressive decoding that can make full use of KV-Cache. We verify theeffectiveness of ACDiT on image and video generation tasks. We also demonstratethat benefitted from autoregressive modeling, ACDiT can be seamlessly used invisual understanding tasks despite being trained on the diffusion objective.The analysis of the trade-off between autoregressive modeling and diffusiondemonstrates the potential of ACDiT to be used in long-horizon visualgeneration tasks. These strengths make it promising as the backbone of futureunified models.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Jinyi Hu Shengding Hu Yuxuan Song Yufei Huang Mingxuan Wang Hao Zhou Zhiyuan Liu Wei-Ying Ma Maosong Sun

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Jinyi Hu Shengding Hu Yuxuan Song Yufei Huang Mingxuan Wang Hao Zhou Zhiyuan Liu Wei-Ying Ma Maosong Sun

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu Shengding Hu Yuxuan Song Yufei Huang Mingxuan Wang Hao Zhou Zhiyuan Liu Wei-Ying Ma Maosong Sun

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu Shengding Hu Yuxuan Song Yufei Huang Mingxuan Wang Hao Zhou Zhiyuan Liu Wei-Ying Ma Maosong Sun

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu Shengding Hu Yuxuan Song Yufei Huang Mingxuan Wang Hao Zhou Zhiyuan Liu Wei-Ying Ma Maosong Sun

Abstract

Build AI with AI

HyperAI Newsletters