HyperAI超神経

SageAttention2++: A More Efficient Implementation of SageAttention2

Zhang, Jintao ; Xu, Xiaoming ; Wei, Jia ; Huang, Haofeng ; Zhang, Pengle ; Xiang, Chendong ; Zhu, Jun ; Chen, Jianfei
公開日: 5/29/2025
SageAttention2++: A More Efficient Implementation of SageAttention2
要約

The efficiency of attention is critical because its time complexity growsquadratically with sequence length. SageAttention2 addresses this by utilizingquantization to accelerate matrix multiplications (Matmul) in attention. Tofurther accelerate SageAttention2, we propose to utilize the faster instructionof FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8Matmul used in SageAttention2. Our experiments show that SageAttention2++achieves a 3.9x speedup over FlashAttention while maintaining the sameattention accuracy as SageAttention2. This means SageAttention2++ effectivelyaccelerates various models, including those for language, image, and videogeneration, with negligible end-to-end metrics loss. The code will be availableat https://github.com/thu-ml/SageAttention.