HyperAIHyperAI

Command Palette

Search for a command to run...

TurboDiffusion : accélérer les modèles de diffusion vidéo de 100 à 200 fois

Jintao Zhang Kaiwen Zheng Kai Jiang Haoxu Wang Ion Stoica Joseph E. Gonzalez Jianfei Chen Jun Zhu

Résumé

Nous présentons TurboDiffusion, un cadre d’accélération de la génération vidéo capable d’accélérer la génération diffusion end-to-end de 100 à 200 fois tout en préservant la qualité vidéo. TurboDiffusion repose principalement sur plusieurs composants d’accélération : (1) Accélération de l’attention : TurboDiffusion utilise l’attention Sage à faible précision (low-bit SageAttention) et une attention linéaire creuse entraînable (Sparse-Linear Attention, SLA) afin d’accélérer le calcul de l’attention. (2) Distillation de pas : TurboDiffusion adopte la méthode rCM pour une distillation de pas efficace. (3) Quantification W8A8 : TurboDiffusion quantifie les paramètres du modèle et les activations à 8 bits afin d’accélérer les couches linéaires et de réduire la taille du modèle. En outre, TurboDiffusion intègre plusieurs autres optimisations logicielles.Nous avons mené des expériences sur les modèles Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P et Wan2.1-T2V-14B-480P. Les résultats expérimentaux montrent que TurboDiffusion permet une accélération de 100 à 200 fois dans la génération vidéo, même sur une seule carte GPU RTX 5090, tout en maintenant une qualité vidéo comparable. Le dépôt GitHub, comprenant les points de contrôle du modèle et un code facile à utiliser, est disponible à l’adresse suivante : https://github.com/thu-ml/TurboDiffusion

One-sentence Summary

Researchers from Tsinghua University, Shengshu Technology, and UC Berkeley propose TurboDiffusion, a framework accelerating video diffusion models 100–200× via SageAttention, SLA, rCM distillation, and W8A8 quantization, preserving quality on RTX 5090 GPUs for real-time applications.

Key Contributions

  • TurboDiffusion accelerates video diffusion models by 100–200× on a single RTX 5090 GPU through algorithmic innovations including low-bit SageAttention, trainable Sparse-Linear Attention, rCM-based step distillation, and W8A8 quantization, without compromising output quality.
  • The framework integrates attention sparsity, reduced sampling steps (e.g., from 100 to 3–4), and 8-bit quantization of weights and activations with block-wise granularity, enabling efficient inference while compressing model size by roughly half.
  • Evaluated across Wan2.1 and Wan2.2 video models (1.3B–14B parameters, 480P–720P resolution), TurboDiffusion cuts generation latency from minutes to seconds — e.g., from 4549s to 38s on Wan2.2-I2V-A14B-720P — while preserving visual fidelity compared to original and FastVideo baselines.

Introduction

The authors leverage a combination of algorithmic and systems-level optimizations to dramatically accelerate video diffusion models, enabling generation speeds 100 to 200 times faster while preserving visual quality. Prior work struggled with the computational intensity of video generation, often requiring minutes to hours per clip even on high-end hardware, limiting real-time or interactive applications. Their main contribution is TurboDiffusion, which integrates low-bit attention, sparse linear attention, step distillation via rCM, and W8A8 quantization alongside engineering refinements, reducing generation time to under a minute per video on a single RTX 5090 GPU across multiple model variants.

Top Figure

Method

The authors leverage a multi-pronged acceleration strategy in TurboDiffusion to achieve up to 200× speedup in video diffusion generation while preserving output fidelity. The framework integrates algorithmic innovations with system-level optimizations, targeting the most computationally intensive components of diffusion models: attention mechanisms, sampling steps, and linear transformations.

At the core of the attention acceleration is the adoption of SageAttention2++, a low-bit attention variant that exploits quantized computation for efficiency. This is further enhanced by Sparse-Linear Attention (SLA), which introduces sparsity patterns to reduce the quadratic complexity of self-attention. Since sparse computation and low-bit acceleration are orthogonal, SLA is implemented atop SageAttention to yield cumulative gains. During inference, the authors deploy SageSLA — a CUDA-optimized implementation of SLA built on SageAttention — to maximize hardware utilization on modern GPUs.

Refer to the framework diagram illustrating the end-to-end acceleration pipeline, which highlights how attention, step reduction, and quantization modules interact during inference.

Step distillation is handled via rCM, a state-of-the-art method for reducing the number of sampling steps in diffusion models. The authors distill a pretrained model into a student model that requires only 3–4 sampling steps instead of the conventional 100, without compromising quality. This distillation is performed in parallel with SLA finetuning, and the resulting parameter updates are merged into a single unified model during training. The rCM approach naturally inherits attention-level optimizations, ensuring that speedups from sparse and low-bit attention carry over to the distilled model.

For linear layer acceleration, TurboDiffusion employs W8A8 quantization — quantizing both weights and activations to INT8 with block-wise granularity of 128×128128 \times 128128×128. This reduces model size by approximately half and enables the use of INT8 Tensor Cores for accelerated matrix multiplication. During inference, activations are dynamically quantized on-the-fly, allowing full precision training while benefiting from quantized inference throughput.

Additional system-level optimizations include custom Triton and CUDA implementations of normalization layers such as LayerNorm and RMSNorm, which further reduce kernel launch overhead and improve memory bandwidth utilization. These optimizations collectively enable TurboDiffusion to achieve sub-second generation latency on high-resolution video models, as demonstrated on a single RTX 5090 GPU.

Experiment

  • TurboDiffusion significantly accelerates video generation across multiple Wan models, achieving speedups of up to 120x compared to original implementations while preserving visual quality.
  • It outperforms FastVideo in both efficiency and output fidelity, particularly on high-resolution and large-parameter models like Wan2.1-T2V-14B-720P.
  • The method maintains consistent performance across diverse prompts — from cinematic action and surreal art to documentary-style scenes — without compromising aesthetic or temporal coherence.
  • Optimal results are achieved with 3–4 sampling steps and a Top-K ratio of 0.1–0.15, balancing sparsity and quality.
  • Acceleration is effective across GPUs including RTX 5090, 4090, and H100, confirming hardware portability and broad applicability.

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp