Pusa-VidGen Video Generation Model Demo
1. Tutorial Introduction

Pusa V1 is an efficient multimodal video generation model proposed by the Yaofang-Liu team on July 25, 2025. Based on Vectorized Timestep Adaptation (VTA), it addresses the core issues of traditional video generation models, such as high training cost, low inference efficiency, and poor temporal consistency. Unlike traditional methods that rely on large amounts of data and computing power, Pusa V1 achieves breakthrough optimizations based on Wan2.1-T2V-14B through a lightweight fine-tuning strategy. Its training cost is only $500 (1/200th that of similar models), and its dataset requires only 4K samples (1/2500th that of similar models). Training can be completed on eight 80GB GPUs, significantly lowering the barrier to entry for video generation technology. Furthermore, it boasts powerful multitasking capabilities, supporting not only text-driven video (T2V) and image-driven video (I2V), but also zero-shot tasks such as video completion, first and last frame generation, and cross-scene transitions, eliminating the need for additional scene-specific training. More importantly, its generation performance is particularly outstanding. It adopts a few-step reasoning strategy (10 steps can surpass the baseline model), and the total score of VBench-I2V reaches 87.32%, with excellent performance in dynamic detail restoration (such as limb movement, light and shadow changes) and temporal coherence. In addition, the non-destructive adaptation mechanism achieved by VTA technology can not only inject temporal dynamic capabilities into the basic model, but also retain the image generation quality of the original model, achieving the effect of "1+1>2". At the deployment level, the inference latency is low, which can meet the diverse needs from fast preview to high-definition output, and is suitable for scenarios such as creative design and short video production. The relevant paper results are " PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation ".
This tutorial uses dual-card RTX A6000 resources.
2. Project Examples
1. Image-to-Video

2. Multi-Frames to Video

3. Video-to-Video

4. Text-to-Video

3. Operation steps
1. After starting the container, click the API address to enter the Web interface

2. Usage steps
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.
2.1 Image-to-Video

Parameter Description
- Generation Parameters
- Noise Multiplier: Adjustable from 0.0 to 1.0, default 0.2 (lower values are more faithful to the input image, higher values are more creative).
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).
2.2 Multi-Frames to Video

Parameter Description
- Conditioning Parameters
- Conditioning Positions: Comma-separated frame indices (e.g. "0,20" defines the time points of the keyframes in the video).
- Noise Multipliers: Comma-separated 0.0-1.0 values (e.g. "0.2,0.5", corresponding to the creative freedom of each keyframe, lower values are more faithful to the frame, higher values are more varied).
- Generation Parameters
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).
2.3 Video-to-Video

Parameter Description
- Conditioning Parameters
- Conditioning Positions: Comma-separated frame indices (e.g., "0,1,2,3", specifying the keyframe positions in the original video used for constraint generation, required).
- Noise Multipliers: Comma-separated 0.0-1.0 values (e.g. "0.0,0.3", corresponding to the degree of influence of each conditional frame, lower values are closer to the original frame, higher values are more flexible).
- Generation Parameters
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).
2.4 Text-to-Video

Parameter Description
- Generation Parameters
- LoRA Alpha: 0.1-5.0 adjustable, default 1.4 (controls style consistency, too high and it will be stiff, too low and it will lose coherence).
- Inference Steps: Adjustable from 1 to 50, default is 10 (the higher the number of steps, the richer the details, but the time consumed increases linearly).
4. Discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information
The citation information for this project is as follows:
@article{liu2025pusa,
title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation},
author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others},
journal={arXiv preprint arXiv:2507.16116},
year={2025}
}
@misc{Liu2025pusa,
title={Pusa: Thousands Timesteps Video Diffusion Model},
author={Yaofang Liu and Rui Liu},
year={2025},
url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
@article{liu2024redefining,
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
journal={arXiv preprint arXiv:2410.03160},
year={2024}
}