Wan2.2-S2V-14B: Film-grade audio-driven Video Generation
1. Tutorial Introduction

Wan2.2-S2V-14B is an audio-driven video generation model open-sourced by the Alibaba Tongyi Wanxiang team in August 2025. Wan2.2-S2V-14B only requires a static image and an audio clip to generate movie-quality digital human videos with video durations up to minutes, supporting a variety of image types and frames. Users can control the video screen by entering text prompts to make the picture richer. The model integrates multiple innovative technologies to achieve audio-driven video generation for complex scenes, supporting long video generation and multi-resolution training and inference. The model has been widely used in digital human live broadcast, film and television production, AI education and other fields. The relevant paper results are "Wan-S2V: Audio-Driven Cinematic Video Generation".
The computing resources used in this tutorial are a single RTX A6000 card.
2. Effect display

3. Operation steps
1. Start the container

2. Usage steps
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.
Note: The more inference steps, the better the generated effect, but the longer the inference generation time will be. Please set the inference steps reasonably (Example 1: When the inference steps are 10, it takes about 15 minutes to generate a video).


Specific parameters:
- Resolution (H*W): resolution.
- The number of frames per segment: Specifies the number of consecutive frames to be processed or generated each time the video is generated.
- Guidance coefficient: controls how strongly the generation process follows the input prompt or conditions (such as text, reference image).
- Number of steps sampled: Specifies the number of iterations used in the diffusion model generation process. Diffusion models typically start with pure noise and undergo multiple denoising steps to obtain the final result.
- Noise shift: Used to adjust the characteristics of noise during the diffusion process, such as changing the distribution or intensity of the noise.
- Random Seed (-1 Random): Controls the initial state of the random number generator.
- Use the reference image as the first frame: A Boolean option. If enabled, the user-provided reference image will be used as the starting frame (first frame) of the generated video.
- Model offloading to save video memory (slower): Model offloading to save video memory (slower).
4. Discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information
The citation information for this project is as follows:
@article{wan2025,
title={Wan: Open and Advanced Large-Scale Video Generative Models},
author={Team Wan and Ang Wang and Baole Ai and Bin Wen and Chaojie Mao and Chen-Wei Xie and Di Chen and Feiwu Yu and Haiming Zhao and Jianxiao Yang and Jianyuan Zeng and Jiayu Wang and Jingfeng Zhang and Jingren Zhou and Jinkai Wang and Jixuan Chen and Kai Zhu and Kang Zhao and Keyu Yan and Lianghua Huang and Mengyang Feng and Ningyi Zhang and Pandeng Li and Pingyu Wu and Ruihang Chu and Ruili Feng and Shiwei Zhang and Siyang Sun and Tao Fang and Tianxing Wang and Tianyi Gui and Tingyu Weng and Tong Shen and Wei Lin and Wei Wang and Wei Wang and Wenmeng Zhou and Wente Wang and Wenting Shen and Wenyuan Yu and Xianzhong Shi and Xiaoming Huang and Xin Xu and Yan Kou and Yangyu Lv and Yifei Li and Yijing Liu and Yiming Wang and Yingya Zhang and Yitong Huang and Yong Li and You Wu and Yu Liu and Yulin Pan and Yun Zheng and Yuntao Hong and Yupeng Shi and Yutong Feng and Zeyinzi Jiang and Zhen Han and Zhi-Fan Wu and Ziyu Liu},
journal = {arXiv preprint arXiv:2503.20314},
year={2025}
}