منذ 5 ساعات

جدول المحتويات

الملخص

أصبحت سير العمل الوكيلية ذات السياق الطويل (Long-context agentic workflows) حالة استخدام محورية لنماذج اللغات الكبيرة، مما جعل كفاءة آلية الانتباه (attention) عاملاً حاسمًا في كل من سرعة الاستدلال وتكلفة التشغيل. وتُعد الانتباه المتناثر (Sparse attention) حلاً فعالاً لهذا التحدي، حيث يُعد نظام الانتباه المتناثر من ديب سيك (DeepSeek Sparse Attention - DSA) مثالاً بارزًا على حل من مستوى الإنتاج: إذ يختار فهرس خفيف الوزن وسريع جدًا (lightning indexer) الرموز (tokens) الأهم كـ k الأعلى صلة لكل استعلام، مما يقلل تعقيد الانتباه الأساسي من $O(L^2)$ إلى $O(Lk)$ . ومع ذلك، يحافظ الفهرس نفسه على تعقيد $O(L^2)$ ويجب تشغيله بشكل مستقل في كل طبقة، على الرغم من أن اختيارات الرموز الأهم كـ k الناتجة متشابهة للغاية عبر الطبقات المتتالية.في هذا العمل، نقدم "إينديكس كاش" (IndexCache)، الذي يستغل هذا التكرار عبر الطبقات من خلال تقسيم الطبقات إلى مجموعة صغيرة من الطبقات الكاملة (Full layers) التي تشغل فهرساتها الخاصة، وإلى أغلبية من الطبقات المشتركة (Shared layers) التي تعيد ببساطة استخدام فهارس الرموز الأهم كـ k للطبقة الكاملة الأقرب. ونقترح منهجين متكاملين لتحديد هذه التكوينات وتحسينها. فإينديكس كاش الخالي من التدريب (Training-free) يطبق خوارزمية بحث جشعة تختار الطبقات التي تحتفظ بفهرساتها من خلال تقليل خسارة نمذجة اللغة مباشرةً على مجموعة معايرة، دون الحاجة إلى أي تحديث للأوزان. أما إينديكس كاش الواعي بالتدريب (Training-aware)، فيقدّم خسارة تقطير متعددة الطبقات (multi-layer distillation loss) تُدرّب كل فهرس محفوظ مقابل متوسط توزيعات الانتباه لجميع الطبقات التي يخدمها، مما يمكّن حتى الأنماط المتداخلة البسيطة من مطابقة دقة الفهرس الكامل.وتُظهر النتائج التجريبية على نموذج DSA بحجم 30 مليار معلمة أن إينديكس كاش يمكنه إزالة 75% من حسابات الفهرسة مع تدهور ضئيل جدًا في الجودة، مما يحقق تسريعًا في مرحلة التجهيز الأولي (prefill) يصل إلى 1.82 مرة، وتسريعًا في مرحلة فك التشفير (decode) يصل إلى 1.48 مرة مقارنةً بـ DSA القياسي. وقد تم تأكيد هذه النتائج الإيجابية بشكل إضافي من خلال تجاربنا الأولية على نموذج GLM-5 بحجم إنتاجي (الشكل 1).

One-sentence Summary

Researchers from Tsinghua University and Z.ai introduce IndexCache, a technique that optimizes DeepSeek Sparse Attention by exploiting cross-layer redundancy to share token indices. This approach eliminates up to 75% of indexer computations in long-context workflows, delivering significant inference speedups without requiring model retraining or degrading output quality.

Key Contributions

Long-context agentic workflows rely on DeepSeek Sparse Attention to reduce core attention complexity, yet the required lightning indexer still incurs quadratic $O(L^2)$ cost at every layer, creating a significant bottleneck for inference speed and serving costs.
IndexCache addresses this redundancy by partitioning layers into Full layers that compute indices and Shared layers that reuse the nearest Full layer's top-k selections, utilizing either a training-free greedy search or a training-aware multi-layer distillation loss to optimize the configuration.
Experiments on a 30B DSA model demonstrate that IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefetch and 1.48x decode speedups while maintaining performance across nine long-context and reasoning benchmarks.

Introduction

Large language models face a critical bottleneck in long-context inference due to the quadratic complexity of self-attention, which sparse mechanisms like DeepSeek Sparse Attention (DSA) address by selecting only the most relevant tokens. While DSA reduces core attention costs, its reliance on a lightweight indexer at every layer still incurs quadratic overhead that dominates latency during the prefill stage. The authors leverage the observation that token selection patterns remain highly stable across consecutive layers to introduce IndexCache, a method that eliminates up to 75% of indexer computations by reusing indices from a small subset of retained layers. They propose both a training-free approach using greedy layer selection and a training-aware strategy with multi-layer distillation to maintain model quality while achieving significant speedups in long-context scenarios.

Top Figure

Method

The authors leverage the observation that sparse attention indexers exhibit significant redundancy across consecutive layers to reduce computational overhead. In standard DeepSeek Sparse Attention, a lightweight lightning indexer scores all preceding tokens at every layer to select the top-k positions. While this reduces core attention complexity from $O(L^2)$ to $O(Lk)$ , the indexer itself retains $O(L^2)$ complexity. IndexCache addresses this by partitioning the $N$ transformer layers into two categories: Full layers and Shared layers. Full layers retain their indexers to compute fresh top-k sets, while Shared layers skip the indexer forward pass and reuse the index set from the nearest preceding Full layer. This design allows the system to eliminate a large fraction of the total indexer cost with minimal architectural changes.

To determine the optimal configuration of Full and Shared layers without retraining, the authors propose a training-free greedy search algorithm. The process begins with all layers designated as Full. The algorithm iteratively evaluates the language modeling loss on a calibration set for each candidate layer conversion. At each step, the layer whose conversion to Shared status results in the lowest loss increase is selected. This data-driven approach identifies which indexers are expendable based on their intrinsic importance to the model's performance rather than relying on uniform interleaving patterns.

For models trained from scratch or via continued pre-training, a training-aware approach further optimizes the indexer parameters for cross-layer sharing. Standard training distills the indexer against the attention distribution of its own layer. IndexCache generalizes this by introducing a multi-layer distillation loss. This objective encourages the retained indexer to predict a top-k set that is jointly useful for itself and all subsequent Shared layers it serves. The loss function is defined as:

\mathcal { L } _ { \mathrm { m u l t i } } ^ { \mathrm { I } } = \sum _ { j = 0 } ^ { m } \frac { 1 } { m + 1 } \sum _ { t } D _ { \mathrm { K L } } \Big ( \mathbf { p } _ { t } ^ { ( \ell + j ) } \, \big | \big | \, \mathbf { q } _ { t } ^ { ( \ell ) } \Big ) \, ,

where $\mathbf{p}_t^{(\ell+j)}$ represents the aggregated attention distribution at layer $\ell+j$ and $\mathbf{q}_t^{(\ell)}$ is the indexer's output distribution. Theoretical analysis shows that this multi-layer loss produces gradients equivalent to distilling against the averaged attention distribution of all served layers. This ensures the indexer learns a consensus top-k selection that covers important tokens across the entire group of layers.

Experimental evaluations on a 30B parameter model demonstrate the efficiency gains achieved by removing indexer computations. The method successfully eliminates up to 75% of indexer costs while maintaining comparable quality. Performance metrics regarding prefill time and decode throughput are summarized below.

The results confirm that IndexCache delivers significant speedups in both prefill and decode phases without degrading model capabilities.

Experiment

End-to-end inference experiments demonstrate that IndexCache significantly accelerates both prefill latency and decode throughput for long-context scenarios, with speedups increasing as context length grows, while maintaining comparable performance on general reasoning tasks.
Training-free IndexCache evaluations reveal that greedy-searched sharing patterns are essential for preserving long-context accuracy at aggressive retention ratios, whereas uniform interleaving causes substantial degradation; however, general reasoning capabilities remain robust across most configurations.
Training-aware IndexCache results show that retraining the model to adapt to index sharing eliminates the sensitivity to specific patterns, allowing simple uniform interleaving to match full-indexer performance and confirming the effectiveness of cross-layer distillation.
Scaling experiments on a 744B-parameter model validate that the trends observed in smaller models hold true, with searched patterns providing stable quality recovery even at high sparsity levels.
Analysis of cross-layer index overlap confirms high redundancy between adjacent layers but reveals that local similarity metrics fail to identify optimal sharing patterns, necessitating end-to-end loss-based search to prevent cascading errors in deep networks.

ملف PDF المصدر عرض الكود

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ 5 ساعات

Yushi Bai Qian Dong Ting Jiang Xin Lv Zhengxiao Du Aohan Zeng Jie Tang Juanzi Li

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

Long-context agentic workflows rely on DeepSeek Sparse Attention to reduce core attention complexity, yet the required lightning indexer still incurs quadratic $O(L^2)$ cost at every layer, creating a significant bottleneck for inference speed and serving costs.
IndexCache addresses this redundancy by partitioning layers into Full layers that compute indices and Shared layers that reuse the nearest Full layer's top-k selections, utilizing either a training-free greedy search or a training-aware multi-layer distillation loss to optimize the configuration.
Experiments on a 30B DSA model demonstrate that IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefetch and 1.48x decode speedups while maintaining performance across nine long-context and reasoning benchmarks.

Introduction

Top Figure

Method

\mathcal { L } _ { \mathrm { m u l t i } } ^ { \mathrm { I } } = \sum _ { j = 0 } ^ { m } \frac { 1 } { m + 1 } \sum _ { t } D _ { \mathrm { K L } } \Big ( \mathbf { p } _ { t } ^ { ( \ell + j ) } \, \big | \big | \, \mathbf { q } _ { t } ^ { ( \ell ) } \Big ) \, ,

The results confirm that IndexCache delivers significant speedups in both prefill and decode phases without degrading model capabilities.

Experiment

End-to-end inference experiments demonstrate that IndexCache significantly accelerates both prefill latency and decode throughput for long-context scenarios, with speedups increasing as context length grows, while maintaining comparable performance on general reasoning tasks.
Training-free IndexCache evaluations reveal that greedy-searched sharing patterns are essential for preserving long-context accuracy at aggressive retention ratios, whereas uniform interleaving causes substantial degradation; however, general reasoning capabilities remain robust across most configurations.
Training-aware IndexCache results show that retraining the model to adapt to index sharing eliminates the sensitivity to specific patterns, allowing simple uniform interleaving to match full-indexer performance and confirming the effectiveness of cross-layer distillation.
Scaling experiments on a 744B-parameter model validate that the trends observed in smaller models hold true, with searched patterns providing stable quality recovery even at high sparsity levels.
Analysis of cross-layer index overlap confirms high redundancy between adjacent layers but reveals that local similarity metrics fail to identify optimal sharing patterns, necessitating end-to-end loss-based search to prevent cascading errors in deep networks.

ملف PDF المصدر عرض الكود

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

فهرس_التخزين المؤقت: تسريع الانتباه المتناثر عبر إعادة استخدام الفهرس عبر الطبقات

Yushi Bai Qian Dong Ting Jiang Xin Lv Zhengxiao Du Aohan Zeng Jie Tang Juanzi Li

الملخص

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

فهرس_التخزين المؤقت: تسريع الانتباه المتناثر عبر إعادة استخدام الفهرس عبر الطبقات

Yushi Bai Qian Dong Ting Jiang Xin Lv Zhengxiao Du Aohan Zeng Jie Tang Juanzi Li

الملخص

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

فهرس_التخزين المؤقت: تسريع الانتباه المتناثر عبر إعادة استخدام الفهرس عبر الطبقات

Yushi Bai Qian Dong Ting Jiang Xin Lv Zhengxiao Du Aohan Zeng Jie Tang Juanzi Li

الملخص

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters