HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable
  Sparse-Linear Attention

Abstract

In Diffusion Transformer (DiT) models, particularly for video generation,attention latency is a major bottleneck due to the long sequence length and thequadratic complexity. We find that attention weights can be separated into twoparts: a small fraction of large weights with high rank and the remainingweights with very low rank. This naturally suggests applying sparseacceleration to the first part and low-rank acceleration to the second. Basedon this finding, we propose SLA (Sparse-Linear Attention), a trainableattention method that fuses sparse and linear attention to accelerate diffusionmodels. SLA classifies attention weights into critical, marginal, andnegligible categories, applying O(N^2) attention to critical weights, O(N)attention to marginal weights, and skipping negligible ones. SLA combines thesecomputations into a single GPU kernel and supports both forward and backwardpasses. With only a few fine-tuning steps using SLA, DiT models achieve a 20xreduction in attention computation, resulting in significant accelerationwithout loss of generation quality. Experiments show that SLA reduces attentioncomputation by 95% without degrading end-to-end generation quality,outperforming baseline methods. In addition, we implement an efficient GPUkernel for SLA, which yields a 13.7x speedup in attention computation and a2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention | Papers | HyperAI