Command Palette
Search for a command to run...
Omni-Diffusion : Compréhension et génération multimodales unifiées par diffusion discrète masquée
Omni-Diffusion : Compréhension et génération multimodales unifiées par diffusion discrète masquée
Lijiang Li Zuwei Long Yunhang Shen Heting Gao Haoyu Cao Xing Sun Caifeng Shan Ran He Chaoyou Fu
Résumé
Bien que les récents grands modèles linguistiques multimodaux (MLLM) aient réalisé des progrès remarquables, ils reposent principalement sur une architecture autorégressive conventionnelle, laissant ainsi une marge importante pour explorer des alternatives efficaces et efficientes en matière de conception architecturale. Parallèlement, des études récentes ont appliqué avec succès des modèles de diffusion discrets à divers domaines, tels que la compréhension visuelle et la génération d'images, révélant leur potentiel considérable en tant que colonne vertébrale prometteuse pour les systèmes multimodaux. S'inspirant de ces recherches pionnières, nous présentons Omni-Diffusion, le premier modèle linguistique multimodal « any-to-any » entièrement construit sur des modèles de diffusion discrets basés sur le masquage, unifiant la compréhension et la génération pour le texte, la parole et les images. Omni-Diffusion utilise un modèle de diffusion discret unifié basé sur le masquage pour capturer directement la distribution conjointe des tokens multimodaux discrets. Cette approche prend en charge non seulement les tâches bimodales, mais également des scénarios plus complexes impliquant plusieurs modalités. Sur un ensemble diversifié de références (benchmarks), notre méthode surpasse ou égale les systèmes multimodaux existants traitant deux modalités ou plus, soulignant ainsi la promesse significative des modèles de diffusion pour alimenter la prochaine génération de modèles de base multimodaux. Page web du projet : https://omni-diffusion.github.io.
One-sentence Summary
Researchers from Nanjing University, Tencent Youtu Lab, and CASIA introduce Omni-Diffusion, the first any-to-any multimodal model built on mask-based discrete diffusion. Unlike autoregressive backbones, this unified architecture captures joint distributions across text, speech, and images, achieving state-of-the-art performance in complex multimodal understanding and generation tasks.
Key Contributions
- Current multimodal systems rely heavily on autoregressive architectures, prompting the need for efficient alternatives that can unify understanding and generation across text, speech, and images.
- Omni-Diffusion introduces the first any-to-any multimodal model built entirely on a mask-based discrete diffusion framework to directly capture the joint distribution of multimodal tokens in a shared semantic space.
- Extensive experiments on diverse benchmarks demonstrate that this approach achieves performance comparable to or better than existing autoregressive systems while supporting complex multi-modal scenarios.
Introduction
Multimodal intelligence currently relies heavily on autoregressive large language models, which limits architectural diversity and often requires separate components to handle generation across different data types like text, images, and speech. While discrete diffusion models have shown promise in individual domains, prior work has struggled to unify them into a single backbone that natively supports any-to-any multimodal tasks without relying on auxiliary decoders or text-only foundations. The authors introduce Omni-Diffusion, the first any-to-any multimodal model built entirely on a mask-based discrete diffusion framework to learn the joint distribution of multimodal tokens. They leverage a three-stage progressive training pipeline and specialized inference techniques, such as attenuated tail-pad masking and position penalties, to achieve performance comparable to or better than existing autoregressive systems while enabling unified comprehension and generation across text, speech, and images.
Method
The Omni-Diffusion model is designed as a unified probabilistic framework that operates over a joint distribution of multimodal discrete tokens. Rather than relying on additional output models to project textual features from large language models into generated multimodal data, the authors directly model an intrinsically unified multimodal discrete representation space. This approach enables effective comprehension and generation of data across text, speech, and image modalities within a single architecture.
Model Architecture and Formulation The core of the system is a mask-based discrete diffusion model built upon the pre-trained Dream-7B language model. To accommodate multimodal inputs, the vocabulary is expanded to include 16,384 speech tokens and 8,192 image tokens. The architecture employs distinct tokenizers for each modality. For images, the authors leverage MAGVIT-v2, which compresses images into discrete tokens with a downsampling factor of 16 and a codebook size of 8,192. For speech, SenseVoiceSmall is used for encoding, while the GLM-4-Voice decoder handles speech generation and tokenization at a rate of 12.5 Hz with a codebook size of 16,384.
As shown in the figure below:

In this architecture, text, image, and speech tokens are wrapped with special beginning and end tokens to form a unified sequence x0∈RL. During training, the model corrupts this sequence by randomly replacing tokens with a special mask token at a ratio derived from the time step t. The model then predicts the clean token sequence x^0=pθ(x0∣xt). The training objective is the cross-entropy loss calculated only on the masked positions:
L=−Et,q(xt∣x0)[i=1∑LI[xti=[MASK]]logpθ(x0i∣xt)]The design utilizes full attention on all multimodal tokens, treating them uniformly within the sequence without modality-specific optimization during the core training process.
Training Strategy To ensure stable training across distinct data distributions, the authors implement a three-stage progressive training pipeline. This strategy gradually extends the model's capabilities from visual-language alignment to full multimodal interaction.
As shown in the figure below:

The first stage focuses on Visual-Language Pre-Alignment, optimizing the model on text-to-image and image captioning tasks to align the visual modality with the semantic space of the language model. The second stage, Speech–Vision–Language Joint Alignment, retains the visual-text datasets while introducing automatic speech recognition and text-to-speech data to facilitate speech-text alignment. The final stage optimizes the model on the constructed Speech-Driven Visual Interaction (SDVI) dataset, which includes spoken visual question answering and speech-to-image generation tasks. This stage further enhances the unified alignment across all modalities. Additionally, an attenuated tail-pad masking strategy is employed to prevent overfitting to pad tokens during variable-length generation.
Overall Framework The resulting system functions as an any-to-any multimodal framework capable of handling diverse tasks.
As shown in the figure below:

This framework supports Speech Tasks such as ASR and TTS, Visual Tasks like captioning and visual QA, and complex Speech-Driven Visual Interaction tasks including speech-to-image generation and spoken visual understanding. By unifying these modalities, the model achieves effective comprehension and generation across text, image, and speech domains.
Experiment
- Main benchmarks evaluate speech recognition, text-to-speech, visual question answering, and text-to-image generation, confirming that the model matches or exceeds specialized and any-to-any baselines in both understanding and generation tasks.
- Speech-to-image experiments validate strong cross-modal alignment, demonstrating that the model produces consistent visual outputs whether conditioned on text or synthesized speech.
- Qualitative examples illustrate the model's ability to generate diverse, high-quality images with fine details and to perform image inpainting without additional fine-tuning, leveraging its mask-token-prediction mechanism.
- Sampling efficiency tests show that the model maintains high generation quality with significantly fewer inference steps compared to autoregressive approaches, highlighting the speed advantages of discrete diffusion.
- Overall, the results establish the model as a unified foundation for multimodal AI, capable of handling diverse modalities with high fidelity and efficiency.