Self-Forcing Real-Time Video Generation
1. Tutorial Introduction

Self-Forcing was proposed by the Xun Huang team on June 9, 2025. It is a new training paradigm for autoregressive video diffusion models. It solves the long-standing problem of exposure bias, whereby models trained on real context must generate sequences based on their own imperfect outputs during inference. Unlike previous methods that denoise future frames based on real context frames, Self-Forcing conditions the generation of each frame on the previously self-generated output by performing autoregressive rollout with a key-value (KV) cache during training. This strategy achieves supervision through a video-level holistic loss function that directly evaluates the quality of the entire generated sequence rather than relying solely on traditional frame-by-frame objective functions. To ensure training efficiency, a few-step diffusion model and a stochastic gradient truncation strategy are adopted, effectively balancing computational cost and performance. A rolling key-value cache mechanism is further introduced to achieve efficient autoregressive video extrapolation. Extensive experiments show that their method can achieve real-time streaming video generation with sub-second latency on a single GPU, while achieving or even exceeding the generation quality of significantly slower and non-causal diffusion models. The relevant paper results are "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion".
This tutorial uses resources for a single RTX 4090 card.
2. Project Examples

3. Operation steps
1. After starting the container, click the API address to enter the Web interface

2. Usage steps

Parameter Description
- Advanced Settings:
- Seed: Random seed value that controls the randomness of the generation process. A fixed seed can reproduce the same results; -1 indicates a random seed.
- Target FPS: Target frame rate. The default value here is 6, which means the generated video is 6 frames per second.
- torch.compile: Enable PyTorch compilation optimization to accelerate model inference (environment support required).
- FP8 Quantization: Enables 8-bit floating-point quantization, reducing computational precision to increase generation speed (may slightly affect quality).
- TAEHV VAE: Specifies the type of variational autoencoder (VAE) model used, which may affect the generated details or style.
4. Discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information
The citation information for this project is as follows:
@article{huang2025selfforcing,
title={Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
journal={arXiv preprint arXiv:2506.08009},
year={2025}
}