Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong

발행일: 6/10/2025

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

초록

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progressin text-driven visual generation. However, even state-of-the-art MM-DiT modelslike FLUX struggle with achieving precise alignment between text prompts andgenerated content. We identify two key issues in the attention mechanism ofMM-DiT, namely 1) the suppression of cross-modal attention due to tokenimbalance between visual and textual modalities and 2) the lack oftimestep-aware attention weighting, which hinder the alignment. To addressthese issues, we propose Temperature-Adjusted Cross-modal Attention(TACA), a parameter-efficient method that dynamically rebalances multimodalinteractions through temperature scaling and timestep-dependent adjustment.When combined with LoRA fine-tuning, TACA significantly enhances text-imagealignment on the T2I-CompBench benchmark with minimal computational overhead.We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstratingits ability to improve image-text alignment in terms of object appearance,attribute binding, and spatial relationships. Our findings highlight theimportance of balancing cross-modal attention in improving semantic fidelity intext-to-image diffusion models. Our codes are publicly available athttps://github.com/Vchitect/TACA

논문 세부 정보 보기 View Code