Command Palette
Search for a command to run...
터보디퓨전: 100~200배 빠르게 하는 비디오 디퓨전 모델
터보디퓨전: 100~200배 빠르게 하는 비디오 디퓨전 모델
Jintao Zhang Kaiwen Zheng Kai Jiang Haoxu Wang Ion Stoica Joseph E. Gonzalez Jianfei Chen Jun Zhu
TurboDiffusion: 이미지 및 텍스트 기반 비디오 생성 시스템
초록
우리는 비디오 생성 속도를 100200배 가속하면서도 비디오 품질을 유지할 수 있는 텍스트 기반 비디오 생성 가속 프레임워크인 TurboDiffusion을 소개한다. TurboDiffusion은 주로 다음과 같은 핵심 구성 요소를 통해 가속을 실현한다: (1) 어텐션 가속: TurboDiffusion은 저비트 SageAttention과 학습 가능한 희소 선형 어텐션(Sparse-Linear Attention, SLA)을 활용하여 어텐션 연산 속도를 향상시킨다. (2) 단계 증류(Step distillation): 효율적인 단계 증류를 위해 rCM 기법을 도입한다. (3) W8A8 양자화: 모델 파라미터와 활성화 값을 8비트로 양자화함으로써 선형 계층의 연산 속도를 높이고 모델 크기를 축소한다. 또한 TurboDiffusion은 여러 가지 엔지니어링 최적화 기법을 통합하여 성능을进一步 향상시킨다.우리는 Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, Wan2.1-T2V-14B-480P 모델을 대상으로 실험을 수행하였다. 실험 결과, 단일 RTX 5090 GPU 환경에서도 TurboDiffusion이 비디오 생성 과정에서 100200배의 속도 향상을 달성하면서도 기존 모델과 유사한 수준의 비디오 품질을 유지함을 확인하였다. 모델 체크포인트와 사용이 간편한 코드를 포함한 GitHub 리포지토리는 https://github.com/thu-ml/TurboDiffusion 에 공개되어 있다.
One-sentence Summary
Researchers from Tsinghua University, Shengshu Technology, and UC Berkeley propose TurboDiffusion, a framework accelerating video diffusion models 100–200× via SageAttention, SLA, rCM distillation, and W8A8 quantization, preserving quality on RTX 5090 GPUs for real-time applications.
Key Contributions
- TurboDiffusion accelerates video diffusion models by 100–200× on a single RTX 5090 GPU through algorithmic innovations including low-bit SageAttention, trainable Sparse-Linear Attention, rCM-based step distillation, and W8A8 quantization, without compromising output quality.
- The framework integrates attention sparsity, reduced sampling steps (e.g., from 100 to 3–4), and 8-bit quantization of weights and activations with block-wise granularity, enabling efficient inference while compressing model size by roughly half.
- Evaluated across Wan2.1 and Wan2.2 video models (1.3B–14B parameters, 480P–720P resolution), TurboDiffusion cuts generation latency from minutes to seconds — e.g., from 4549s to 38s on Wan2.2-I2V-A14B-720P — while preserving visual fidelity compared to original and FastVideo baselines.
Introduction
The authors leverage a combination of algorithmic and systems-level optimizations to dramatically accelerate video diffusion models, enabling generation speeds 100 to 200 times faster while preserving visual quality. Prior work struggled with the computational intensity of video generation, often requiring minutes to hours per clip even on high-end hardware, limiting real-time or interactive applications. Their main contribution is TurboDiffusion, which integrates low-bit attention, sparse linear attention, step distillation via rCM, and W8A8 quantization alongside engineering refinements, reducing generation time to under a minute per video on a single RTX 5090 GPU across multiple model variants.

Method
The authors leverage a multi-pronged acceleration strategy in TurboDiffusion to achieve up to 200× speedup in video diffusion generation while preserving output fidelity. The framework integrates algorithmic innovations with system-level optimizations, targeting the most computationally intensive components of diffusion models: attention mechanisms, sampling steps, and linear transformations.
At the core of the attention acceleration is the adoption of SageAttention2++, a low-bit attention variant that exploits quantized computation for efficiency. This is further enhanced by Sparse-Linear Attention (SLA), which introduces sparsity patterns to reduce the quadratic complexity of self-attention. Since sparse computation and low-bit acceleration are orthogonal, SLA is implemented atop SageAttention to yield cumulative gains. During inference, the authors deploy SageSLA — a CUDA-optimized implementation of SLA built on SageAttention — to maximize hardware utilization on modern GPUs.
Refer to the framework diagram illustrating the end-to-end acceleration pipeline, which highlights how attention, step reduction, and quantization modules interact during inference.
Step distillation is handled via rCM, a state-of-the-art method for reducing the number of sampling steps in diffusion models. The authors distill a pretrained model into a student model that requires only 3–4 sampling steps instead of the conventional 100, without compromising quality. This distillation is performed in parallel with SLA finetuning, and the resulting parameter updates are merged into a single unified model during training. The rCM approach naturally inherits attention-level optimizations, ensuring that speedups from sparse and low-bit attention carry over to the distilled model.
For linear layer acceleration, TurboDiffusion employs W8A8 quantization — quantizing both weights and activations to INT8 with block-wise granularity of 128×128. This reduces model size by approximately half and enables the use of INT8 Tensor Cores for accelerated matrix multiplication. During inference, activations are dynamically quantized on-the-fly, allowing full precision training while benefiting from quantized inference throughput.
Additional system-level optimizations include custom Triton and CUDA implementations of normalization layers such as LayerNorm and RMSNorm, which further reduce kernel launch overhead and improve memory bandwidth utilization. These optimizations collectively enable TurboDiffusion to achieve sub-second generation latency on high-resolution video models, as demonstrated on a single RTX 5090 GPU.
Experiment
- TurboDiffusion significantly accelerates video generation across multiple Wan models, achieving speedups of up to 120x compared to original implementations while preserving visual quality.
- It outperforms FastVideo in both efficiency and output fidelity, particularly on high-resolution and large-parameter models like Wan2.1-T2V-14B-720P.
- The method maintains consistent performance across diverse prompts — from cinematic action and surreal art to documentary-style scenes — without compromising aesthetic or temporal coherence.
- Optimal results are achieved with 3–4 sampling steps and a Top-K ratio of 0.1–0.15, balancing sparsity and quality.
- Acceleration is effective across GPUs including RTX 5090, 4090, and H100, confirming hardware portability and broad applicability.