HyperAI

MMaDA: Multimodal Large Diffusion Language Models

Yang, Ling ; Tian, Ye ; Li, Bowen ; Zhang, Xinchen ; Shen, Ke ; Tong, Yunhai ; Wang, Mengdi
Veröffentlichungsdatum: 5/22/2025
MMaDA: Multimodal Large Diffusion Language Models
Abstract

We introduce MMaDA, a novel class of multimodal diffusion foundation modelsdesigned to achieve superior performance across diverse domains such as textualreasoning, multimodal understanding, and text-to-image generation. The approachis distinguished by three key innovations: (i) MMaDA adopts a unified diffusionarchitecture with a shared probabilistic formulation and a modality-agnosticdesign, eliminating the need for modality-specific components. Thisarchitecture ensures seamless integration and processing across different datatypes. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuningstrategy that curates a unified CoT format across modalities. By aligningreasoning processes between textual and visual domains, this strategyfacilitates cold-start training for the final reinforcement learning (RL)stage, thereby enhancing the model's ability to handle complex tasks from theoutset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithmspecifically tailored for diffusion foundation models. Utilizing diversifiedreward modeling, UniGRPO unifies post-training across both reasoning andgeneration tasks, ensuring consistent performance improvements. Experimentalresults demonstrate that MMaDA-8B exhibits strong generalization capabilitiesas a unified multimodal foundation model. It surpasses powerful models likeLLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X inmultimodal understanding, and excels over SDXL and Janus in text-to-imagegeneration. These achievements highlight MMaDA's effectiveness in bridging thegap between pretraining and post-training within unified diffusionarchitectures, providing a comprehensive framework for future research anddevelopment. We open-source our code and trained models at:https://github.com/Gen-Verse/MMaDA