a month ago

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Jintao Zhang Haoxu Wang Kai Jiang Shuo Yang Kaiwen Zheng Haocheng Xi Ziteng Wang Hongzhou Zhu Min Zhao Ion Stoica

Abstract

In Diffusion Transformer (DiT) models, particularly for video generation,attention latency is a major bottleneck due to the long sequence length and thequadratic complexity. We find that attention weights can be separated into twoparts: a small fraction of large weights with high rank and the remainingweights with very low rank. This naturally suggests applying sparseacceleration to the first part and low-rank acceleration to the second. Basedon this finding, we propose SLA (Sparse-Linear Attention), a trainableattention method that fuses sparse and linear attention to accelerate diffusionmodels. SLA classifies attention weights into critical, marginal, andnegligible categories, applying O(N^2) attention to critical weights, O(N)attention to marginal weights, and skipping negligible ones. SLA combines thesecomputations into a single GPU kernel and supports both forward and backwardpasses. With only a few fine-tuning steps using SLA, DiT models achieve a 20xreduction in attention computation, resulting in significant accelerationwithout loss of generation quality. Experiments show that SLA reduces attentioncomputation by 95% without degrading end-to-end generation quality,outperforming baseline methods. In addition, we implement an efficient GPUkernel for SLA, which yields a 13.7x speedup in attention computation and a2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Jintao Zhang Haoxu Wang Kai Jiang Shuo Yang Kaiwen Zheng Haocheng Xi Ziteng Wang Hongzhou Zhu Min Zhao Ion Stoica3 more

Abstract

Build AI with AI

Hyper Newsletters

Jintao Zhang Haoxu Wang Kai Jiang Shuo Yang Kaiwen Zheng Haocheng Xi Ziteng Wang Hongzhou Zhu Min Zhao Ion Stoica