1. Tutorial Introduction

MMaDA-8B-Base is a multimodal diffusion large language model jointly developed by Princeton University, ByteDance Seed Team, Peking University and Tsinghua University and released on May 23, 2025. This model is the first systematic exploration of diffusion architecture as a unified model of multimodal basic paradigm, aiming to achieve general intelligence capabilities for cross-modal tasks through the deep integration of text reasoning, multimodal understanding and image generation. The related paper results are "MMaDA: Multimodal Large Diffusion Language Models".

The computing resources of this tutorial use a single A6000 card, and the model deployed in this tutorial is MMaDA-8B-Base. Three examples of Text Generation, Multimodal Understanding, and Text-to-Image Generation are provided for testing.

2. Effect display

MMaDA decoding demo

Decoding demo of MMaDA. This video shows how to generate text and images using the diffusion basis model.
The "text generation" part uses a semi-autoregressive sampling method, while the "multimodal generation" part uses non-autoregressive diffusion denoising.

3. Operation steps

1. Start the container

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

2. Usage steps

1. Text Generation

Specific parameters:

Prompt: You can enter text here.
Generation Length: The number of generated tokens.
Total Sampling Steps: Must be divisible by (gen_length / block_length).
Block Length: gen_length must be divisible by this number.
Remasking Strategy: Remasking strategy.
CFG Scale: No classifier guide. 0 disables it.
Temperature: Controls randomness via Gumbel noise. 0 is deterministic.

result

2. Multimodal Understanding

Specific parameters:

Prompt: You can enter text here.
Generation Length: The number of generated tokens.
Total Sampling Steps: Must be divisible by (gen_length / block_length).
Block Length: gen_length must be divisible by this number.
Remasking Strategy: Remasking strategy.
CFG Scale: No classifier guide. 0 disables it.
Temperature: Controls randomness via Gumbel noise. 0 is deterministic.
Image: picture.

result

3. Text-to-Image Generation

Specific parameters:

Prompt: You can enter text here.
Total Sampling Steps: Must be divisible by (gen_length / block_length).
Guidance Scale: No classifier guidance. 0 disables it.
Scheduler:
- cosine: Cosine similarity calculates the similarity of sentence pairs and optimizes the embedding vectors.
- sigmoid: multi-label classification.
- Linear: The linear layer maps the image patch embedding vector to a higher dimension for attention calculation.

result

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

Thanks to Github user SuperYang Deployment of this tutorial. The reference information of this project is as follows:

@article{yang2025mmada,
  title={MMaDA: Multimodal Large Diffusion Language Models},
  author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
  journal={arXiv preprint arXiv:2505.15809},
  year={2025}
}

MMaDA: Multimodal Large Diffuse Language Model

1. Tutorial Introduction

2. Effect display

3. Operation steps

1. Start the container

2. Usage steps

1. Text Generation

2. Multimodal Understanding

3. Text-to-Image Generation

4. Discussion

Citation Information