12 days ago

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu

View Paper Details View Code

Abstract

We present Qwen-Image, an image generation foundation model in the Qwenseries that achieves significant advances in complex text rendering and preciseimage editing. To address the challenges of complex text rendering, we design acomprehensive data pipeline that includes large-scale data collection,filtering, annotation, synthesis, and balancing. Moreover, we adopt aprogressive training strategy that starts with non-text-to-text rendering,evolves from simple to complex textual inputs, and gradually scales up toparagraph-level descriptions. This curriculum learning approach substantiallyenhances the model's native text rendering capabilities. As a result,Qwen-Image not only performs exceptionally well in alphabetic languages such asEnglish, but also achieves remarkable progress on more challenging logographiclanguages like Chinese. To enhance image editing consistency, we introduce animproved multi-task training paradigm that incorporates not only traditionaltext-to-image (T2I) and text-image-to-image (TI2I) tasks but alsoimage-to-image (I2I) reconstruction, effectively aligning the latentrepresentations between Qwen2.5-VL and MMDiT. Furthermore, we separately feedthe original image into Qwen2.5-VL and the VAE encoder to obtain semantic andreconstructive representations, respectively. This dual-encoding mechanismenables the editing module to strike a balance between preserving semanticconsistency and maintaining visual fidelity. Qwen-Image achievesstate-of-the-art performance, demonstrating its strong capabilities in bothimage generation and editing across multiple benchmarks.