Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer and Fine-Grained Chinese Understanding Model

This model is the first Chinese-English bilingual DiT architecture, a text-to-image generation model based on the Diffusion transformer, which has fine-grained understanding capabilities in Chinese and English. In order to build the mixed DiT, the research team carefully redesigned the Transformer structure, text encoder, and position encoding. A complete data pipeline was built to update and evaluate data to help with model optimization iterations. In order to achieve fine-grained text understanding, this project trained a multimodal large language model to optimize the text description of the image. In the end, the mixed DiT was able to have multiple rounds of conversations with users, generating and perfecting images based on the context.

🎉 Hunyuan-DiT Main Features

Hunyuan-DiT is a diffusion model in latent space, as shown in the figure below. Based on the latent diffusion model, a pre-trained variational autoencoder (VAE) is used to compress the image into a low-dimensional latent space, and the diffusion model is trained to learn the data distribution. The diffusion model is parameterized by transformer. To encode text prompts, the model uses a combination of pre-trained bilingual (English and Chinese) CLIP and multilingual T5 encoders.

Multi-round text graph construction

Understanding natural language instructions and interacting with users in multiple rounds is very important for artificial intelligence. Text-to-image systems can help build a dynamic, iterative creative process that turns users' ideas into reality step by step. In this section, we will detail how to give Hunyuan-DiT the ability to perform multi-round dialogues and image generation, train MLM to understand multi-round user dialogues, and output new text prompts for image generation.

Model generation performance

Long text input

📈 Comparison with existing models

In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, the research team constructed a four-dimensional test set, which invited more than 50 professional evaluators to evaluate performance in areas including text-image consistency, exclusion of AI artifacts, topic clarity, and aesthetics.

Model	Open Source	Text-Image Consistency (%)	Excluding AI Artifacts (%)	Subject Clarity (%)	Aesthetics (%)	Overall (%)
Model	Open Source	SDXL	✔	64.3	Aesthetics (%)	Overall (%)	60.6	91.1	76.3	42.7
PixArt-α	✔	68.3	60.9	93.2	77.5	45.5
Playground 2.5	✔	71.9	70.8	94.9	83.3	54.3
SD 3	✘	77.1	69.3	94.6	82.5	56.7
MidJourney v6	✘	73.5	80.2	93.5	87.2	63.3
DALL-E 3	✘	83.9	80.3	96.5	89.4	71.0
Hunyuan-DiT	✔	74.2	74.3	95.4	86.6	59.0

Tutorial Usage

1. Clone and start the container

[Note] Since the model is large, it may take about 2 to 3 minutes after the container is successfully started to wait for the model to be loaded before it can be used.

2. User interface

The larger the number of sampling steps, the better the generation effect, but the longer the generation time.默认的采样步数生成时间在一分钟左右

Tencent HunyuanDiT Wenshengtu Demo

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer and Fine-Grained Chinese Understanding Model

🎉 Hunyuan-DiT Main Features

Multi-round text graph construction

Model generation performance

📈 Comparison with existing models

Tutorial Usage

1. Clone and start the container

2. User interface

Build AI with AI

Hyper Newsletters

Command Palette

Tencent HunyuanDiT Wenshengtu Demo

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer and Fine-Grained Chinese Understanding Model

🎉 Hunyuan-DiT Main Features

Multi-round text graph construction

Model generation performance

📈 Comparison with existing models

Tutorial Usage

1. Clone and start the container

2. User interface

Build AI with AI

Hyper Newsletters