SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Zhang, Jintao ; Wei, Jia ; Zhang, Pengle ; Xu, Xiaoming ; Huang, Haofeng ; Wang, Haoxu ; Jiang, Kai ; Zhu, Jun ; Chen, Jianfei

تاريخ النشر: 5/21/2025

SageAttention3: Microscaling FP4 Attention for Inference and An
Exploration of 8-Bit Training

الملخص

The efficiency of attention is important due to its quadratic timecomplexity. We enhance the efficiency of attention through two keycontributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs toaccelerate attention computation. Our implementation achieves 1038 TOPS onRTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090.Experiments show that our FP4 attention can accelerate inference of variousmodels in a plug-and-play way. Second, we pioneer low-bit attention to trainingtasks. Existing low-bit attention works like FlashAttention3 and SageAttentionfocus only on inference. However, the efficiency of training large models isalso important. To explore whether low-bit attention can be effectively appliedto training tasks, we design an accurate and efficient 8-bit attention for bothforward and backward propagation. Experiments indicate that 8-bit attentionachieves lossless performance in fine-tuning tasks but exhibits slowerconvergence in pretraining tasks. The code will be available athttps://github.com/thu-ml/SageAttention.

عرض تفاصيل الورقة البحثية