SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Gao, Yizhao ; Guo, Shuming ; Cao, Shijie ; Xia, Yuqing ; Cheng, Yu ; Wang, Lei ; Ma, Lingxiao ; Sun, Yutao ; Ye, Tianzhu ; Dong, Li ; So, Hayden Kwok-Hay ; Hua, Yu ; Cao, Ting ; Yang, Fan ; Yang, Mao

발행일: 6/12/2025

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

초록

We introduce SeerAttention-R, a sparse attention framework specificallytailored for the long decoding of reasoning models. Extended fromSeerAttention, SeerAttention-R retains the design of learning attentionsparsity through a self-distilled gating mechanism, while removing querypooling to accommodate auto-regressive decoding. With a lightweight plug-ingating, SeerAttention-R is flexible and can be easily integrated into existingpretrained model without modifying the original parameters. We demonstrate thatSeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoningaccuracy with 4K token budget in AIME benchmark under large sparse attentionblock sizes (64/128). Using TileLang, we develop a highly optimized sparsedecoding kernel that achieves near-theoretical speedups of up to 9x overFlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:https://github.com/microsoft/SeerAttention.

논문 세부 정보 보기