7日前

概要

ご指定いただいた条件に基づき、提供された英文テキストを専門的な技術文書のスタイルで日本語に翻訳いたしました。【翻訳文】Large Language Models (LLMs) における推論時間の延長（Extended reasoning）は、KV cache メモリの深刻なボトルネックを引き起こします。主要な KV cache 圧縮手法は、RoPE 適用後の直近の query から得られる attention score を用いて KV の重要度を推定します。しかし、RoPE 適用過程において query は位置に応じて回転するため、代表的な query が極めて少なくなり、結果として top-key の選択精度が低下し、推論の不安定化を招きます。この問題を回避するため、我々は pre-RoPE 空間に着目しました。その結果、Q および K ベクトルは固定された非ゼロの中心点（non-zero centers）付近に高度に集中しており、位置に関わらず安定しているという現象（Q/K concentration）を観測しました。我々は、この集中現象によって、query が特定の距離（例：最も近い key など）にある key に対して優先的に attention を向けるようになることを示しました。そして、これらの中心点が三角級数（trigonometric series）を介して、どの距離が優先されるかを決定しています。これに基づき、我々はこれらの中心点を活用して key の重要度を推定する「TriAttention」を提案します。TriAttention では、三角級数を用いて、これらの中心点によって特徴付けられる距離の選好性（distance preference）に基づき、位置に応じた key のスコアリングを行います。さらに、重要度推定の追加信号として Q/K の norm も活用します。AIME25 における 32K-token の生成タスクにおいて、TriAttention は Full Attention と同等の推論精度を維持しつつ、2.5倍の高いスループット、あるいは 10.7倍の KV メモリ削減を実現しました。これに対し、既存の主要な baseline 手法は、同等の効率下では精度の半分程度しか達成できませんでした。TriAttention により、Full Attention ではメモリ不足（out-of-memory）となるような長文コンテキストにおいても、単一のコンシューマ向け GPU 上での OpenClaw デプロイが可能となります。

One-sentence Summary

To address the instability of post-RoPE importance estimation, researchers propose TriAttention, a KV cache compression method that leverages Q/K concentration in the pre-RoPE space and uses a trigonometric series to model distance-based attention preferences, matching Full Attention accuracy on AIME25 with 32K-token generation while achieving 2.5x higher throughput or 10.7x KV memory reduction.

Key Contributions

The paper identifies a phenomenon called Q/K concentration in the pre-RoPE space, where query and key vectors cluster around stable, non-zero centers regardless of position.
This work introduces TriAttention, a method that estimates key importance by using a trigonometric series to characterize distance preferences based on these stable centers and incorporating Q/K norms as an additional signal.
Experimental results on benchmarks such as AIME25 demonstrate that TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction compared to existing baselines.

Introduction

As large language models engage in extended reasoning, the growing KV cache creates significant memory bottlenecks that hinder long-context performance. Existing compression methods typically estimate token importance in the post-RoPE space, but these approaches struggle because positional rotations cause query vectors to shift constantly. This rotation makes it difficult to identify representative queries and leads to unstable importance estimation or the loss of critical directional information. The authors leverage a discovered property called Q/K concentration in the pre-RoPE space, where vectors cluster around stable, fixed centers. By using a trigonometric series to model how these centers determine distance-based attention preferences, they propose TriAttention to estimate key importance more reliably and efficiently.

Dataset

Please provide the paper paragraphs you would like me to summarize. The text provided in your prompt only contains a title and a placeholder sentence regarding a benchmark, which does not contain the necessary technical details (composition, sources, sizes, or processing rules) required to draft the description.

Once you provide the full text, I will generate the concise dataset description following all your requirements.

Method

The authors leverage Rotary Position Embedding (RoPE) to model positional information through rotations in vector space, which is a foundational component of their method. RoPE operates by dividing a $d$ -dimensional vector into $d/2$ two-dimensional subspaces, each associated with a frequency band $f$ . For each band, a rotation by angle $\omega_f p$ is applied at position $p$ , where $\omega_f = \theta^{-2f/d}$ and $\theta = 10000$ is a fixed constant. This rotation is expressed as a linear transformation on the vector components $(x_{2f}, x_{2f+1})$ , resulting in post-RoPE vectors. The authors observe that queries and keys in the pre-RoPE space are highly concentrated around non-zero centers, a phenomenon consistent across different positions and contexts, as illustrated in the distribution plots shown in the figure below.

Pre-RoPE distribution of query and key vectors

This concentration is quantified using the Mean Resultant Length $R = \|\mathbb{E}[q]\| / \mathbb{E}[\|q\|]$ , where values approaching 1 indicate strong directional concentration. The authors show that this concentration enables the attention computation to be approximated by a trigonometric series. When query and key vectors are approximately constant, the attention logit simplifies to a sum of cosine and sine terms in the relative position $\Delta = p_q - p_k$ , forming a trigonometric series with coefficients determined by the magnitudes and phases of the vectors.

Based on this analysis, the authors propose TriAttention, a KV cache compression method that scores key importance for pruning. The scoring function combines two components: a trigonometric series score $S_{\text{trig}}$ and a norm-based score $S_{\text{norm}}$ . The trigonometric series score estimates attention based on distance preference, using the expected query center $\mathbb{E}[q_f]$ as a proxy for future queries and computing a cosine similarity weighted by vector magnitudes and phase differences. The norm-based score accounts for variations around the center by using the expected query norm $\mathbb{E}[\|q_f\|]$ and key magnitude $\|k_f\|$ . The final combined score is $S(k, \Delta) = S_{\text{trig}}(k, \Delta) + S_{\text{norm}}(k)$ .

To adapt the scoring to varying levels of concentration, the authors introduce an adaptive weighting mechanism. The Mean Resultant Length $R_f$ for each frequency band $f$ is used to scale the norm-based score. When $R_f$ is high (strong concentration), the contribution of $S_{\text{norm}}$ is reduced, emphasizing the trigonometric series. When $R_f$ is low (weak concentration), the full norm contribution is preserved. The final score $S_{\text{final}}(k)$ is derived by averaging the score over multiple future query positions and applying a normalize-then-aggregate strategy for Grouped-Query Attention, where scores from different query heads are z-score normalized and combined via a maximum operation.

The method is implemented with window-based pruning, where key scoring and pruning are triggered every 128 generated tokens to reduce computational overhead. Keys are retained based on their final score, and the top- $B$ keys are kept in the KV cache. The overall framework, as illustrated in the figure below, begins with offline calibration to compute query and key distribution centers, followed by the scoring process during inference, and concludes with the retention of top-scoring keys to produce a pruned attention map.

The effectiveness of this approach is demonstrated by its ability to maintain correct memory retention in recursive tasks, where losing intermediate states leads to error propagation and a corrupted final result, as shown in the figure below.

Memory retention in recursive simulation

Experiment

The researchers evaluate TriAttention, a KV cache compression method, by testing its ability to reconstruct attention patterns through trigonometric series and its performance on mathematical reasoning, retrieval, and agentic tasks. Experiments across various architectures and benchmarks demonstrate that TriAttention effectively preserves essential information for long-chain reasoning and memory retention while significantly improving throughput and reducing memory footprints. The results show that the method maintains accuracy comparable to full attention even under aggressive compression, outperforming existing observation-based pruning baselines.

TriAttention achieves significant throughput improvements over Full Attention while maintaining comparable accuracy across multiple benchmarks. The method shows substantial speedup, particularly on MATH 500, and reduces KV cache memory requirements, enabling efficient long-context reasoning. TriAttention achieves up to 6.3× higher throughput than Full Attention on MATH 500. TriAttention matches Full Attention accuracy while reducing KV budget significantly on AIME24 and AIME25. TriAttention enables efficient long-context reasoning, allowing successful task completion within limited GPU memory.

TriAttention achieves the highest accuracy across all tested KV cache budgets compared to other compression methods. It matches or exceeds the performance of FullKV at lower memory usage, demonstrating superior efficiency and accuracy. TriAttention achieves the highest accuracy at all budget levels compared to other methods. TriAttention matches FullKV performance at lower memory usage, showing improved efficiency. TriAttention outperforms all baselines, including H2O, TOVA, and RaaS, across different KV cache budgets.

The experiment evaluates the impact of future offset range and spacing strategy on model accuracy. Results show that increasing the maximum distance improves performance, while geometric spacing outperforms linear spacing in maintaining accuracy. Increasing the maximum distance improves accuracy Geometric spacing outperforms linear spacing Accuracy varies with the number of offsets

{"caption": "AIME performance on reasoning and coding", "summary": "The the the table compares performance on AIME24 and AIME25 benchmarks between coding and reasoning tasks. Reasoning tasks show lower performance than coding tasks on both benchmarks, indicating a gap in model capabilities across domains.", "highlights": ["Reasoning tasks achieve lower scores than coding tasks on both AIME24 and AIME25.", "Performance is consistently higher on AIME24 compared to AIME25 for both coding and reasoning tasks.", "The gap between coding and reasoning performance is larger on AIME25 than on AIME24."]

[[IMG:http://api-rsrc.hyper.ai/2604.04921/41063cc6-280f-4700-a63f-3eba5c13a885/tex_resource/extracted_tables/table-4.png|]]

TriAttention achieves the highest average score on the RULER benchmark, surpassing other methods. The results demonstrate superior performance compared to SnapKV and PyramidKV, highlighting its effectiveness in retrieval tasks. TriAttention achieves the highest RULER average score among all methods TriAttention significantly outperforms SnapKV and PyramidKV on RULER TriAttention demonstrates strong retrieval capabilities on the RULER benchmark

TriAttention outperforms baselines on RULER

TriAttention is evaluated across various benchmarks to validate its throughput, memory efficiency, and retrieval capabilities compared to full attention and existing compression methods. The results demonstrate that TriAttention significantly improves throughput and reduces KV cache requirements while maintaining or exceeding the accuracy of baseline models. Furthermore, ablation studies indicate that performance is optimized through specific offset ranges and geometric spacing strategies.

ソースPDF コードを表示

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

7日前

Weian Mao Xi Lin Wei Huang Yuxin Xie Tianfu Fu Bohan Zhuang Song Han Yukang Chen

概要

One-sentence Summary

Key Contributions

The paper identifies a phenomenon called Q/K concentration in the pre-RoPE space, where query and key vectors cluster around stable, non-zero centers regardless of position.
This work introduces TriAttention, a method that estimates key importance by using a trigonometric series to characterize distance preferences based on these stable centers and incorporating Q/K norms as an additional signal.
Experimental results on benchmarks such as AIME25 demonstrate that TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction compared to existing baselines.

Introduction

Dataset

Once you provide the full text, I will generate the concise dataset description following all your requirements.

Method

Experiment

[[IMG:http://api-rsrc.hyper.ai/2604.04921/41063cc6-280f-4700-a63f-3eba5c13a885/tex_resource/extracted_tables/table-4.png|]]

ソースPDF コードを表示

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

TriAttention: 三角関数を用いたKV Compressionによる効率的なLong Reasoning

Weian Mao Xi Lin Wei Huang Yuxin Xie Tianfu Fu Bohan Zhuang Song Han Yukang Chen

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

TriAttention: 三角関数を用いたKV Compressionによる効率的なLong Reasoning

Weian Mao Xi Lin Wei Huang Yuxin Xie Tianfu Fu Bohan Zhuang Song Han Yukang Chen

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

TriAttention: 三角関数を用いたKV Compressionによる効率的なLong Reasoning

Weian Mao Xi Lin Wei Huang Yuxin Xie Tianfu Fu Bohan Zhuang Song Han Yukang Chen

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters