Wan2.2: An Open-source High-level Large-scale Video Generation Model
1. Tutorial Introduction

Wan-2.2 is an advanced AI video generation model open-sourced by Alibaba's Tongyi Wanxiang Laboratory on July 28, 2025. A total of three models, namely, text-generated video (Wan2.2-T2V-A14B), image-generated video (Wan2.2-I2V-A14B), and unified video generation (Wan2.2-IT2V-5B), are open-sourced, with a total of 27 billion parameters. The model introduced the mixture of experts (MoE) architecture for the first time, effectively improving the generation quality and computational efficiency. At the same time, it pioneered a film-level aesthetic control system that can accurately control aesthetic effects such as light, shadow, color, and composition. The 5B parameter compact video generation model used in the tutorial supports text and image generation videos, can run on consumer-grade graphics cards, and is based on an efficient 3D VAE architecture to achieve high compression rates and the ability to quickly generate high-definition videos. The relevant paper results are "Wan: Open and Advanced Large-Scale Video Generative Models".
This tutorial uses a single RTX A6000 GPU as the computing resource and deploys the Wan2.2-IT2V-5B model. Two examples, Text-to-Video Generation and Image-to-Video Generation, are provided for testing.
2. Effect display
Text generation video

Image-generated video

3. Operation steps
1. Start the container

2. Usage steps
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.
1. Text-to-Video Generation
Specific parameters:
- Prompt: The text describing the video content you want to generate.
- Duration: Specify the desired video duration (in seconds).
- Output Resolution: Select the resolution (width x height) of the generated video.
- Sampling Steps: Controls the number of iterative optimizations during video generation (the number of denoising steps for the diffusion model).
- Guidance Scale: Controls how well the model follows the user's prompt words.
- Sample Shift: Related to the sampler used, used to adjust the sampling process parameters.
- Seed: Controls the randomness of the generation process.

2. Image-to-Video Generation

4. Discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information
The citation information for this project is as follows:
@article{wan2025,
title={Wan: Open and Advanced Large-Scale Video Generative Models},
author={Team Wan and Ang Wang and Baole Ai and Bin Wen and Chaojie Mao and Chen-Wei Xie and Di Chen and Feiwu Yu and Haiming Zhao and Jianxiao Yang and Jianyuan Zeng and Jiayu Wang and Jingfeng Zhang and Jingren Zhou and Jinkai Wang and Jixuan Chen and Kai Zhu and Kang Zhao and Keyu Yan and Lianghua Huang and Mengyang Feng and Ningyi Zhang and Pandeng Li and Pingyu Wu and Ruihang Chu and Ruili Feng and Shiwei Zhang and Siyang Sun and Tao Fang and Tianxing Wang and Tianyi Gui and Tingyu Weng and Tong Shen and Wei Lin and Wei Wang and Wei Wang and Wenmeng Zhou and Wente Wang and Wenting Shen and Wenyuan Yu and Xianzhong Shi and Xiaoming Huang and Xin Xu and Yan Kou and Yangyu Lv and Yifei Li and Yijing Liu and Yiming Wang and Yingya Zhang and Yitong Huang and Yong Li and You Wu and Yu Liu and Yulin Pan and Yun Zheng and Yuntao Hong and Yupeng Shi and Yutong Feng and Zeyinzi Jiang and Zhen Han and Zhi-Fan Wu and Ziyu Liu},
journal = {arXiv preprint arXiv:2503.20314},
year={2025}
}