HyperAIHyperAI

Command Palette

Search for a command to run...

TurboDiffusion:動画拡散モデルの処理速度を100〜200倍に高速化する

Jintao Zhang Kaiwen Zheng Kai Jiang Haoxu Wang Ion Stoica Joseph E. Gonzalez Jianfei Chen Jun Zhu

概要

本稿では、映像生成を100~200倍高速化しつつ、映像品質を維持可能な「TurboDiffusion」という高速化フレームワークを紹介する。TurboDiffusionは、以下の主な構成要素により高速化を実現している。(1)アテンション計算の高速化:低ビットのSageAttentionと学習可能なSparse-Linear Attention(SLA)を用いて、アテンション計算を高速化する。(2)ステップ蒸留(step distillation):効率的なステップ蒸留のためにrCMを採用している。(3)W8A8量子化:モデルパラメータおよび活性化値を8ビットに量子化することで、線形層の処理を高速化し、モデルの圧縮を実現する。さらに、他の複数のエンジニアリング最適化も統合している。本研究では、Wan2.2-I2V-14B-720P、Wan2.1-T2V-1.3B-480P、Wan2.1-T2V-14B-720P、Wan2.1-T2V-14B-480Pの各モデルを対象に実験を実施した。実験結果から、TurboDiffusionは単一のRTX 5090 GPU上でも、映像生成処理を100~200倍高速化しつつ、比較的高い映像品質を維持できることを示した。モデルのチェックポイントおよび使いやすいコードを含むGitHubリポジトリは、https://github.com/thu-ml/TurboDiffusion にて公開されている。

One-sentence Summary

Researchers from Tsinghua University, Shengshu Technology, and UC Berkeley propose TurboDiffusion, a framework accelerating video diffusion models 100–200× via SageAttention, SLA, rCM distillation, and W8A8 quantization, preserving quality on RTX 5090 GPUs for real-time applications.

Key Contributions

  • TurboDiffusion accelerates video diffusion models by 100–200× on a single RTX 5090 GPU through algorithmic innovations including low-bit SageAttention, trainable Sparse-Linear Attention, rCM-based step distillation, and W8A8 quantization, without compromising output quality.
  • The framework integrates attention sparsity, reduced sampling steps (e.g., from 100 to 3–4), and 8-bit quantization of weights and activations with block-wise granularity, enabling efficient inference while compressing model size by roughly half.
  • Evaluated across Wan2.1 and Wan2.2 video models (1.3B–14B parameters, 480P–720P resolution), TurboDiffusion cuts generation latency from minutes to seconds — e.g., from 4549s to 38s on Wan2.2-I2V-A14B-720P — while preserving visual fidelity compared to original and FastVideo baselines.

Introduction

The authors leverage a combination of algorithmic and systems-level optimizations to dramatically accelerate video diffusion models, enabling generation speeds 100 to 200 times faster while preserving visual quality. Prior work struggled with the computational intensity of video generation, often requiring minutes to hours per clip even on high-end hardware, limiting real-time or interactive applications. Their main contribution is TurboDiffusion, which integrates low-bit attention, sparse linear attention, step distillation via rCM, and W8A8 quantization alongside engineering refinements, reducing generation time to under a minute per video on a single RTX 5090 GPU across multiple model variants.

Top Figure

Method

The authors leverage a multi-pronged acceleration strategy in TurboDiffusion to achieve up to 200× speedup in video diffusion generation while preserving output fidelity. The framework integrates algorithmic innovations with system-level optimizations, targeting the most computationally intensive components of diffusion models: attention mechanisms, sampling steps, and linear transformations.

At the core of the attention acceleration is the adoption of SageAttention2++, a low-bit attention variant that exploits quantized computation for efficiency. This is further enhanced by Sparse-Linear Attention (SLA), which introduces sparsity patterns to reduce the quadratic complexity of self-attention. Since sparse computation and low-bit acceleration are orthogonal, SLA is implemented atop SageAttention to yield cumulative gains. During inference, the authors deploy SageSLA — a CUDA-optimized implementation of SLA built on SageAttention — to maximize hardware utilization on modern GPUs.

Refer to the framework diagram illustrating the end-to-end acceleration pipeline, which highlights how attention, step reduction, and quantization modules interact during inference.

Step distillation is handled via rCM, a state-of-the-art method for reducing the number of sampling steps in diffusion models. The authors distill a pretrained model into a student model that requires only 3–4 sampling steps instead of the conventional 100, without compromising quality. This distillation is performed in parallel with SLA finetuning, and the resulting parameter updates are merged into a single unified model during training. The rCM approach naturally inherits attention-level optimizations, ensuring that speedups from sparse and low-bit attention carry over to the distilled model.

For linear layer acceleration, TurboDiffusion employs W8A8 quantization — quantizing both weights and activations to INT8 with block-wise granularity of 128×128128 \times 128128×128. This reduces model size by approximately half and enables the use of INT8 Tensor Cores for accelerated matrix multiplication. During inference, activations are dynamically quantized on-the-fly, allowing full precision training while benefiting from quantized inference throughput.

Additional system-level optimizations include custom Triton and CUDA implementations of normalization layers such as LayerNorm and RMSNorm, which further reduce kernel launch overhead and improve memory bandwidth utilization. These optimizations collectively enable TurboDiffusion to achieve sub-second generation latency on high-resolution video models, as demonstrated on a single RTX 5090 GPU.

Experiment

  • TurboDiffusion significantly accelerates video generation across multiple Wan models, achieving speedups of up to 120x compared to original implementations while preserving visual quality.
  • It outperforms FastVideo in both efficiency and output fidelity, particularly on high-resolution and large-parameter models like Wan2.1-T2V-14B-720P.
  • The method maintains consistent performance across diverse prompts — from cinematic action and surreal art to documentary-style scenes — without compromising aesthetic or temporal coherence.
  • Optimal results are achieved with 3–4 sampling steps and a Top-K ratio of 0.1–0.15, balancing sparsity and quality.
  • Acceleration is effective across GPUs including RTX 5090, 4090, and H100, confirming hardware portability and broad applicability.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています
TurboDiffusion:動画拡散モデルの処理速度を100〜200倍に高速化する | 記事 | HyperAI超神経