HyperAIHyperAI

Command Palette

Search for a command to run...

TurboDiffusion: Beschleunigung von Videodiffusionsmodellen um das 100- bis 200-Fache

Jintao Zhang Kaiwen Zheng Kai Jiang Haoxu Wang Ion Stoica Joseph E. Gonzalez Jianfei Chen Jun Zhu

Zusammenfassung

Wir stellen TurboDiffusion vor, einen Beschleunigungsrahmen für die Videogenerierung, der die end-to-end-Diffusion-Generierung um das 100- bis 200-fache beschleunigt, ohne die Videoqualität zu beeinträchtigen. TurboDiffusion basiert hauptsächlich auf mehreren Komponenten zur Beschleunigung: (1) Aufmerksamkeitsbeschleunigung: TurboDiffusion nutzt Low-Bit-SageAttention sowie trainierbare Sparse-Linear Attention (SLA), um die Berechnung der Aufmerksamkeit zu beschleunigen. (2) Schrittdistillation: TurboDiffusion setzt rCM für eine effiziente Schrittdistillation ein. (3) W8A8-Quantisierung: TurboDiffusion quantisiert Modellparameter und Aktivierungen auf 8 Bit, um lineare Schichten zu beschleunigen und das Modell zu komprimieren. Zusätzlich integriert TurboDiffusion mehrere weitere ingenieurtechnische Optimierungen.Wir führen Experimente an den Modellen Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P und Wan2.1-T2V-14B-480P durch. Die experimentellen Ergebnisse zeigen, dass TurboDiffusion selbst auf einer einzigen RTX 5090-GPU eine Beschleunigung der Videogenerierung um das 100- bis 200-fache erreicht, wobei die Videoqualität vergleichbar bleibt. Der GitHub-Repository mit Modell-Checkpoint und benutzerfreundlichem Code ist unter https://github.com/thu-ml/TurboDiffusion verfügbar.

One-sentence Summary

Researchers from Tsinghua University, Shengshu Technology, and UC Berkeley propose TurboDiffusion, a framework accelerating video diffusion models 100–200× via SageAttention, SLA, rCM distillation, and W8A8 quantization, preserving quality on RTX 5090 GPUs for real-time applications.

Key Contributions

  • TurboDiffusion accelerates video diffusion models by 100–200× on a single RTX 5090 GPU through algorithmic innovations including low-bit SageAttention, trainable Sparse-Linear Attention, rCM-based step distillation, and W8A8 quantization, without compromising output quality.
  • The framework integrates attention sparsity, reduced sampling steps (e.g., from 100 to 3–4), and 8-bit quantization of weights and activations with block-wise granularity, enabling efficient inference while compressing model size by roughly half.
  • Evaluated across Wan2.1 and Wan2.2 video models (1.3B–14B parameters, 480P–720P resolution), TurboDiffusion cuts generation latency from minutes to seconds — e.g., from 4549s to 38s on Wan2.2-I2V-A14B-720P — while preserving visual fidelity compared to original and FastVideo baselines.

Introduction

The authors leverage a combination of algorithmic and systems-level optimizations to dramatically accelerate video diffusion models, enabling generation speeds 100 to 200 times faster while preserving visual quality. Prior work struggled with the computational intensity of video generation, often requiring minutes to hours per clip even on high-end hardware, limiting real-time or interactive applications. Their main contribution is TurboDiffusion, which integrates low-bit attention, sparse linear attention, step distillation via rCM, and W8A8 quantization alongside engineering refinements, reducing generation time to under a minute per video on a single RTX 5090 GPU across multiple model variants.

Top Figure

Method

The authors leverage a multi-pronged acceleration strategy in TurboDiffusion to achieve up to 200× speedup in video diffusion generation while preserving output fidelity. The framework integrates algorithmic innovations with system-level optimizations, targeting the most computationally intensive components of diffusion models: attention mechanisms, sampling steps, and linear transformations.

At the core of the attention acceleration is the adoption of SageAttention2++, a low-bit attention variant that exploits quantized computation for efficiency. This is further enhanced by Sparse-Linear Attention (SLA), which introduces sparsity patterns to reduce the quadratic complexity of self-attention. Since sparse computation and low-bit acceleration are orthogonal, SLA is implemented atop SageAttention to yield cumulative gains. During inference, the authors deploy SageSLA — a CUDA-optimized implementation of SLA built on SageAttention — to maximize hardware utilization on modern GPUs.

Refer to the framework diagram illustrating the end-to-end acceleration pipeline, which highlights how attention, step reduction, and quantization modules interact during inference.

Step distillation is handled via rCM, a state-of-the-art method for reducing the number of sampling steps in diffusion models. The authors distill a pretrained model into a student model that requires only 3–4 sampling steps instead of the conventional 100, without compromising quality. This distillation is performed in parallel with SLA finetuning, and the resulting parameter updates are merged into a single unified model during training. The rCM approach naturally inherits attention-level optimizations, ensuring that speedups from sparse and low-bit attention carry over to the distilled model.

For linear layer acceleration, TurboDiffusion employs W8A8 quantization — quantizing both weights and activations to INT8 with block-wise granularity of 128×128128 \times 128128×128. This reduces model size by approximately half and enables the use of INT8 Tensor Cores for accelerated matrix multiplication. During inference, activations are dynamically quantized on-the-fly, allowing full precision training while benefiting from quantized inference throughput.

Additional system-level optimizations include custom Triton and CUDA implementations of normalization layers such as LayerNorm and RMSNorm, which further reduce kernel launch overhead and improve memory bandwidth utilization. These optimizations collectively enable TurboDiffusion to achieve sub-second generation latency on high-resolution video models, as demonstrated on a single RTX 5090 GPU.

Experiment

  • TurboDiffusion significantly accelerates video generation across multiple Wan models, achieving speedups of up to 120x compared to original implementations while preserving visual quality.
  • It outperforms FastVideo in both efficiency and output fidelity, particularly on high-resolution and large-parameter models like Wan2.1-T2V-14B-720P.
  • The method maintains consistent performance across diverse prompts — from cinematic action and surreal art to documentary-style scenes — without compromising aesthetic or temporal coherence.
  • Optimal results are achieved with 3–4 sampling steps and a Top-K ratio of 0.1–0.15, balancing sparsity and quality.
  • Acceleration is effective across GPUs including RTX 5090, 4090, and H100, confirming hardware portability and broad applicability.

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp