Command Palette
Search for a command to run...
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Abstract
In Diffusion Transformer (DiT) models, particularly for video generation,attention latency is a major bottleneck due to the long sequence length and thequadratic complexity. We find that attention weights can be separated into twoparts: a small fraction of large weights with high rank and the remainingweights with very low rank. This naturally suggests applying sparseacceleration to the first part and low-rank acceleration to the second. Basedon this finding, we propose SLA (Sparse-Linear Attention), a trainableattention method that fuses sparse and linear attention to accelerate diffusionmodels. SLA classifies attention weights into critical, marginal, andnegligible categories, applying O(N^2) attention to critical weights, O(N)attention to marginal weights, and skipping negligible ones. SLA combines thesecomputations into a single GPU kernel and supports both forward and backwardpasses. With only a few fine-tuning steps using SLA, DiT models achieve a 20xreduction in attention computation, resulting in significant accelerationwithout loss of generation quality. Experiments show that SLA reduces attentioncomputation by 95% without degrading end-to-end generation quality,outperforming baseline methods. In addition, we implement an efficient GPUkernel for SLA, which yields a 13.7x speedup in attention computation and a2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.