HyperAIHyperAI

Command Palette

Search for a command to run...

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li Zuwei Long Yunhang Shen Heting Gao Haoyu Cao Xing Sun Caifeng Shan Ran He Chaoyou Fu

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

One-sentence Summary

Researchers from Nanjing University, Tencent Youtu Lab, and CASIA introduce Omni-Diffusion, the first any-to-any multimodal model built on mask-based discrete diffusion. Unlike autoregressive backbones, this unified architecture captures joint distributions across text, speech, and images, achieving state-of-the-art performance in complex multimodal understanding and generation tasks.

Key Contributions

  • Current multimodal systems rely heavily on autoregressive architectures, prompting the need for efficient alternatives that can unify understanding and generation across text, speech, and images.
  • Omni-Diffusion introduces the first any-to-any multimodal model built entirely on a mask-based discrete diffusion framework to directly capture the joint distribution of multimodal tokens in a shared semantic space.
  • Extensive experiments on diverse benchmarks demonstrate that this approach achieves performance comparable to or better than existing autoregressive systems while supporting complex multi-modal scenarios.

Introduction

Multimodal intelligence currently relies heavily on autoregressive large language models, which limits architectural diversity and often requires separate components to handle generation across different data types like text, images, and speech. While discrete diffusion models have shown promise in individual domains, prior work has struggled to unify them into a single backbone that natively supports any-to-any multimodal tasks without relying on auxiliary decoders or text-only foundations. The authors introduce Omni-Diffusion, the first any-to-any multimodal model built entirely on a mask-based discrete diffusion framework to learn the joint distribution of multimodal tokens. They leverage a three-stage progressive training pipeline and specialized inference techniques, such as attenuated tail-pad masking and position penalties, to achieve performance comparable to or better than existing autoregressive systems while enabling unified comprehension and generation across text, speech, and images.

Method

The Omni-Diffusion model is designed as a unified probabilistic framework that operates over a joint distribution of multimodal discrete tokens. Rather than relying on additional output models to project textual features from large language models into generated multimodal data, the authors directly model an intrinsically unified multimodal discrete representation space. This approach enables effective comprehension and generation of data across text, speech, and image modalities within a single architecture.

Model Architecture and Formulation The core of the system is a mask-based discrete diffusion model built upon the pre-trained Dream-7B language model. To accommodate multimodal inputs, the vocabulary is expanded to include 16,384 speech tokens and 8,192 image tokens. The architecture employs distinct tokenizers for each modality. For images, the authors leverage MAGVIT-v2, which compresses images into discrete tokens with a downsampling factor of 16 and a codebook size of 8,192. For speech, SenseVoiceSmall is used for encoding, while the GLM-4-Voice decoder handles speech generation and tokenization at a rate of 12.5 Hz with a codebook size of 16,384.

As shown in the figure below:

In this architecture, text, image, and speech tokens are wrapped with special beginning and end tokens to form a unified sequence x0RLx_{0} \in \mathbb{R}^{L}x0RL. During training, the model corrupts this sequence by randomly replacing tokens with a special mask token at a ratio derived from the time step ttt. The model then predicts the clean token sequence x^0=pθ(x0xt)\hat{x}_{0}=p_{\theta}(x_{0}|x_{t})x^0=pθ(x0xt). The training objective is the cross-entropy loss calculated only on the masked positions:

L=Et,q(xtx0)[i=1LI[xti=[MASK]]logpθ(x0ixt)]L = - \mathbb { E } _ { t , q ( x _ { t } | x _ { 0 } ) } \left[ \sum _ { i = 1 } ^ { L } \mathbb { I } \left[ x _ { t } ^ { i } = [ \mathrm { M A S K } ] \right] \log p _ { \theta } ( x _ { 0 } ^ { i } | x _ { t } ) \right]L=Et,q(xtx0)[i=1LI[xti=[MASK]]logpθ(x0ixt)]

The design utilizes full attention on all multimodal tokens, treating them uniformly within the sequence without modality-specific optimization during the core training process.

Training Strategy To ensure stable training across distinct data distributions, the authors implement a three-stage progressive training pipeline. This strategy gradually extends the model's capabilities from visual-language alignment to full multimodal interaction.

As shown in the figure below:

The first stage focuses on Visual-Language Pre-Alignment, optimizing the model on text-to-image and image captioning tasks to align the visual modality with the semantic space of the language model. The second stage, Speech–Vision–Language Joint Alignment, retains the visual-text datasets while introducing automatic speech recognition and text-to-speech data to facilitate speech-text alignment. The final stage optimizes the model on the constructed Speech-Driven Visual Interaction (SDVI) dataset, which includes spoken visual question answering and speech-to-image generation tasks. This stage further enhances the unified alignment across all modalities. Additionally, an attenuated tail-pad masking strategy is employed to prevent overfitting to pad tokens during variable-length generation.

Overall Framework The resulting system functions as an any-to-any multimodal framework capable of handling diverse tasks.

As shown in the figure below:

This framework supports Speech Tasks such as ASR and TTS, Visual Tasks like captioning and visual QA, and complex Speech-Driven Visual Interaction tasks including speech-to-image generation and spoken visual understanding. By unifying these modalities, the model achieves effective comprehension and generation across text, image, and speech domains.

Experiment

  • Main benchmarks evaluate speech recognition, text-to-speech, visual question answering, and text-to-image generation, confirming that the model matches or exceeds specialized and any-to-any baselines in both understanding and generation tasks.
  • Speech-to-image experiments validate strong cross-modal alignment, demonstrating that the model produces consistent visual outputs whether conditioned on text or synthesized speech.
  • Qualitative examples illustrate the model's ability to generate diverse, high-quality images with fine details and to perform image inpainting without additional fine-tuning, leveraging its mask-token-prediction mechanism.
  • Sampling efficiency tests show that the model maintains high generation quality with significantly fewer inference steps compared to autoregressive approaches, highlighting the speed advantages of discrete diffusion.
  • Overall, the results establish the model as a unified foundation for multimodal AI, capable of handling diverse modalities with high fidelity and efficiency.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp