SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

We introduce SeerAttention-R, a sparse attention framework specificallytailored for the long decoding of reasoning models. Extended fromSeerAttention, SeerAttention-R retains the design of learning attentionsparsity through a self-distilled gating mechanism, while removing querypooling to accommodate auto-regressive decoding. With a lightweight plug-ingating, SeerAttention-R is flexible and can be easily integrated into existingpretrained model without modifying the original parameters. We demonstrate thatSeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoningaccuracy with 4K token budget in AIME benchmark under large sparse attentionblock sizes (64/128). Using TileLang, we develop a highly optimized sparsedecoding kernel that achieves near-theoretical speedups of up to 9x overFlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:https://github.com/microsoft/SeerAttention.