HyperAIHyperAI

Command Palette

Search for a command to run...

Omni-Diffusion: 마스킹된 이산 확산을 통한 통합 멀티모달 이해 및 생성

Lijiang Li Zuwei Long Yunhang Shen Heting Gao Haoyu Cao Xing Sun Caifeng Shan Ran He Chaoyou Fu

초록

최근 다중 모달 대규모 언어 모델 (MLLM) 들은 괄목할 만한 진전을 이루었으나, 대부분의 모델은 여전히 전통적인 자기회귀 (autoregressive) 구조를 백본으로 활용하고 있어, 아키텍처 설계 측면에서 효과적이고 효율적인 대안을 탐구할 여지가 상당합니다. 동시에 최근 연구들은 이산 확산 (discrete diffusion) 모델을 시각 이해 및 이미지 생성 등 다양한 도메인에 성공적으로 적용함으로써, 다중 모달 시스템의 유망한 백본으로서의 잠재력을 입증하였습니다. 이러한 선구적인 연구에서 영감을 받아, 우리는 텍스트, 음성, 이미지를 아우르는 이해와 생성을 통합한 최초의 범용 다중 모달 언어 모델인 'Omni-Diffusion'을 제안합니다. Omni-Diffusion 은 마스킹 기반 이산 확산 모델에 전적으로 기반을 두고 있으며, 이산화된 다중 모달 토큰에 대한 결합 분포를 직접 포착하기 위해 통일된 마스킹 기반 이산 확산 모델을 활용합니다. 이 접근법은 이모달 (bimodal) 태스크는 물론, 여러 모달리티가 관여하는 더 복잡한 시나리오도 지원합니다. 다양한 벤치마크에서 우리의 방법은 두 개 이상의 모달리티를 처리하는 기존 다중 모달 시스템보다 우수한 성능을 보이거나 동등한 수준을 달성하였으며, 이는 확산 모델이 차세대 다중 모달 기반 모델 (foundation models) 을 구동하는 데 있어 큰 가능성을 지니고 있음을 시사합니다. 프로젝트 페이지: https://omni-diffusion.github.io.

One-sentence Summary

Researchers from Nanjing University, Tencent Youtu Lab, and CASIA introduce Omni-Diffusion, the first any-to-any multimodal model built on mask-based discrete diffusion. Unlike autoregressive backbones, this unified architecture captures joint distributions across text, speech, and images, achieving state-of-the-art performance in complex multimodal understanding and generation tasks.

Key Contributions

  • Current multimodal systems rely heavily on autoregressive architectures, prompting the need for efficient alternatives that can unify understanding and generation across text, speech, and images.
  • Omni-Diffusion introduces the first any-to-any multimodal model built entirely on a mask-based discrete diffusion framework to directly capture the joint distribution of multimodal tokens in a shared semantic space.
  • Extensive experiments on diverse benchmarks demonstrate that this approach achieves performance comparable to or better than existing autoregressive systems while supporting complex multi-modal scenarios.

Introduction

Multimodal intelligence currently relies heavily on autoregressive large language models, which limits architectural diversity and often requires separate components to handle generation across different data types like text, images, and speech. While discrete diffusion models have shown promise in individual domains, prior work has struggled to unify them into a single backbone that natively supports any-to-any multimodal tasks without relying on auxiliary decoders or text-only foundations. The authors introduce Omni-Diffusion, the first any-to-any multimodal model built entirely on a mask-based discrete diffusion framework to learn the joint distribution of multimodal tokens. They leverage a three-stage progressive training pipeline and specialized inference techniques, such as attenuated tail-pad masking and position penalties, to achieve performance comparable to or better than existing autoregressive systems while enabling unified comprehension and generation across text, speech, and images.

Method

The Omni-Diffusion model is designed as a unified probabilistic framework that operates over a joint distribution of multimodal discrete tokens. Rather than relying on additional output models to project textual features from large language models into generated multimodal data, the authors directly model an intrinsically unified multimodal discrete representation space. This approach enables effective comprehension and generation of data across text, speech, and image modalities within a single architecture.

Model Architecture and Formulation The core of the system is a mask-based discrete diffusion model built upon the pre-trained Dream-7B language model. To accommodate multimodal inputs, the vocabulary is expanded to include 16,384 speech tokens and 8,192 image tokens. The architecture employs distinct tokenizers for each modality. For images, the authors leverage MAGVIT-v2, which compresses images into discrete tokens with a downsampling factor of 16 and a codebook size of 8,192. For speech, SenseVoiceSmall is used for encoding, while the GLM-4-Voice decoder handles speech generation and tokenization at a rate of 12.5 Hz with a codebook size of 16,384.

As shown in the figure below:

In this architecture, text, image, and speech tokens are wrapped with special beginning and end tokens to form a unified sequence x0RLx_{0} \in \mathbb{R}^{L}x0RL. During training, the model corrupts this sequence by randomly replacing tokens with a special mask token at a ratio derived from the time step ttt. The model then predicts the clean token sequence x^0=pθ(x0xt)\hat{x}_{0}=p_{\theta}(x_{0}|x_{t})x^0=pθ(x0xt). The training objective is the cross-entropy loss calculated only on the masked positions:

L=Et,q(xtx0)[i=1LI[xti=[MASK]]logpθ(x0ixt)]L = - \mathbb { E } _ { t , q ( x _ { t } | x _ { 0 } ) } \left[ \sum _ { i = 1 } ^ { L } \mathbb { I } \left[ x _ { t } ^ { i } = [ \mathrm { M A S K } ] \right] \log p _ { \theta } ( x _ { 0 } ^ { i } | x _ { t } ) \right]L=Et,q(xtx0)[i=1LI[xti=[MASK]]logpθ(x0ixt)]

The design utilizes full attention on all multimodal tokens, treating them uniformly within the sequence without modality-specific optimization during the core training process.

Training Strategy To ensure stable training across distinct data distributions, the authors implement a three-stage progressive training pipeline. This strategy gradually extends the model's capabilities from visual-language alignment to full multimodal interaction.

As shown in the figure below:

The first stage focuses on Visual-Language Pre-Alignment, optimizing the model on text-to-image and image captioning tasks to align the visual modality with the semantic space of the language model. The second stage, Speech–Vision–Language Joint Alignment, retains the visual-text datasets while introducing automatic speech recognition and text-to-speech data to facilitate speech-text alignment. The final stage optimizes the model on the constructed Speech-Driven Visual Interaction (SDVI) dataset, which includes spoken visual question answering and speech-to-image generation tasks. This stage further enhances the unified alignment across all modalities. Additionally, an attenuated tail-pad masking strategy is employed to prevent overfitting to pad tokens during variable-length generation.

Overall Framework The resulting system functions as an any-to-any multimodal framework capable of handling diverse tasks.

As shown in the figure below:

This framework supports Speech Tasks such as ASR and TTS, Visual Tasks like captioning and visual QA, and complex Speech-Driven Visual Interaction tasks including speech-to-image generation and spoken visual understanding. By unifying these modalities, the model achieves effective comprehension and generation across text, image, and speech domains.

Experiment

  • Main benchmarks evaluate speech recognition, text-to-speech, visual question answering, and text-to-image generation, confirming that the model matches or exceeds specialized and any-to-any baselines in both understanding and generation tasks.
  • Speech-to-image experiments validate strong cross-modal alignment, demonstrating that the model produces consistent visual outputs whether conditioned on text or synthesized speech.
  • Qualitative examples illustrate the model's ability to generate diverse, high-quality images with fine details and to perform image inpainting without additional fine-tuning, leveraging its mask-token-prediction mechanism.
  • Sampling efficiency tests show that the model maintains high generation quality with significantly fewer inference steps compared to autoregressive approaches, highlighting the speed advantages of discrete diffusion.
  • Overall, the results establish the model as a unified foundation for multimodal AI, capable of handling diverse modalities with high fidelity and efficiency.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp