HyperAI

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu
تاريخ النشر: 5/12/2025
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video
  Generation
الملخص

Customized video generation aims to produce videos featuring specificsubjects under flexible user-defined conditions, yet existing methods oftenstruggle with identity consistency and limited input modalities. In this paper,we propose HunyuanCustom, a multi-modal customized video generation frameworkthat emphasizes subject consistency while supporting image, audio, video, andtext conditions. Built upon HunyuanVideo, our model first addresses theimage-text conditioned generation task by introducing a text-image fusionmodule based on LLaVA for enhanced multi-modal understanding, along with animage ID enhancement module that leverages temporal concatenation to reinforceidentity features across frames. To enable audio- and video-conditionedgeneration, we further propose modality-specific condition injectionmechanisms: an AudioNet module that achieves hierarchical alignment via spatialcross-attention, and a video-driven injection module that integrateslatent-compressed conditional video through a patchify-based feature-alignmentnetwork. Extensive experiments on single- and multi-subject scenariosdemonstrate that HunyuanCustom significantly outperforms state-of-the-art open-and closed-source methods in terms of ID consistency, realism, and text-videoalignment. Moreover, we validate its robustness across downstream tasks,including audio and video-driven customized video generation. Our resultshighlight the effectiveness of multi-modal conditioning and identity-preservingstrategies in advancing controllable video generation. All the code and modelsare available at https://hunyuancustom.github.io.