HyperAIHyperAI

Command Palette

Search for a command to run...

البقايا الانتباهية

الملخص

تُعدّ التوصيلات التبقية (Residual connections) مع PreNorm معيارًا في نماذج اللغة الكبيرة (LLMs) الحديثة، غير أنها تجمع مخرجات جميع الطبقات بأوزان ثابتة موحدة. يؤدي هذا التجميع الموحد إلى نمو غير مضبوط للحالة المخفية مع زيادة عمق النموذج، مما يُخفّض تدريجيًا مساهمة كل طبقة. نقترح في هذا العمل توصيلات بقية الانتباه (Attention Residuals، اختصارًا AttnRes)، التي تحل محل هذا التجميع الثابت بآلية انتباه (softmax attention) تُطبَّق على مخرجات الطبقات السابقة، ما يمكّن كل طبقة من تجميع تمثيلات سابقة بشكل انتقائي باستخدام أوزان مُتعَلَّمة تعتمد على المدخلات. ولمعالجة عبء الذاكرة والاتصال الناتج عن الانتباه على مخرجات جميع الطبقات السابقة أثناء تدريب النماذج واسعة النطاق، نقترح Block AttnRes، الذي يقسّم الطبقات إلى كتل (blocks) ويُطبّق الانتباه على التمثيلات على مستوى الكتلة، مما يقلل البصمة الذاكرة مع الحفاظ على الغالبية العظمى من مكاسب AttnRes الكامل. وعند دمجه مع اتصال خط الأنابيب القائم على الذاكرة المؤقتة (cache-based pipeline communication) واستراتيجية حساب ثنائية الطور، يصبح Block AttnRes بديلاً عمليًا جاهزًا للاستبدال (drop-in replacement) للتوصيلات التبقية القياسية، بعبء إضافي ضئيل. وتؤكد تجارب قوانين التوسع (scaling law) أن التحسن ثابت عبر أحجام النماذج المختلفة، بينما تُثبت تجارب الاستبعاد (ablations) فائدة الاختيار العميق المعتمد على المحتوى. وعلاوة على ذلك، قمنا بدمج AttnRes في معمارية Kimi Linear (بإجمالي 48 مليار معلمة، منها 3 مليارات معلمة فعالة) وقمنا بالتدريب المسبق على 1.4 تريليون رمز (token)، حيث خفّض AttnRes تأثير التخفيف الناتج عن PreNorm، مُنتجًا مقادير مخرجات وتوزيعات تدرج أكثر تجانسًا عبر العمق، مع تحسين الأداء في المهام اللاحقة عبر جميع المهام المُقيَّمة.

One-sentence Summary

The Kimi Team proposes Attention Residuals, a novel mechanism replacing fixed residual weights with learned softmax attention to mitigate hidden-state dilution in large language models. Their optimized Block AttnRes variant reduces memory overhead while significantly improving training stability and downstream task performance across various model scales.

Key Contributions

  • The paper introduces Attention Residuals (AttnRes), a mechanism that replaces fixed unit-weight accumulation with learned softmax attention over preceding layer outputs to enable selective, content-dependent aggregation of representations across depth.
  • To address scalability, the work presents Block AttnRes, which partitions layers into blocks and attends over block-level summaries to reduce memory and communication complexity from O(Ld)O(Ld)O(Ld) to O(Nd)O(Nd)O(Nd) while preserving performance gains.
  • Comprehensive experiments on a 48B-parameter model pre-trained on 1.4T tokens demonstrate that the method mitigates hidden-state dilution, yields more uniform gradient distributions, and consistently improves downstream task performance compared to standard residual connections.

Introduction

The research addresses the challenge of efficiently scaling large language models while maintaining high performance, a critical need for deploying advanced AI in real-world applications. Prior approaches often struggle with the computational overhead of attention mechanisms or fail to fully utilize residual connections for stable training at scale. To overcome these limitations, the authors introduce a novel framework centered on attention residuals that optimizes information flow and reduces training costs without sacrificing model quality.

Method

The authors propose Attention Residuals (AttnRes) to address the limitations of standard residual connections in deep networks. In standard architectures, the hidden state update follows a fixed recurrence hl=hl1+fl1(hl1)h_l = h_{l-1} + f_{l-1}(h_{l-1})hl=hl1+fl1(hl1), which unrolls to a uniform sum of all preceding layer outputs. This fixed aggregation causes hidden-state magnitudes to grow linearly with depth, diluting the contribution of individual layers. AttnRes replaces this fixed accumulation with a learned, input-dependent attention mechanism over depth.

Refer to the framework diagram for a visual comparison of the residual connection variants.

As shown in the figure, the standard approach (a) simply adds the previous layer output. In contrast, Full AttnRes (b) allows each layer to selectively aggregate all previous layer outputs via learned attention weights. Specifically, the input to layer lll is computed as hl=i=0l1αilvih_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_ihl=i=0l1αilvi, where viv_ivi represents the output of layer iii (or the embedding for i=0i=0i=0). The attention weights αil\alpha_{i \to l}αil are derived from a softmax over a kernel function ϕ(ql,ki)\phi(q_l, k_i)ϕ(ql,ki), where ql=wlq_l = w_lql=wl is a learnable pseudo-query vector specific to layer lll, and ki=vik_i = v_iki=vi are the keys derived from previous outputs. This mechanism enables content-aware retrieval across depth with minimal parameter overhead.

To make this approach scalable for large models, the authors introduce Block AttnRes (c). This variant partitions the LLL layers into NNN blocks. Within each block, layer outputs are reduced to a single representation via summation, and attention is applied only over the NNN block-level representations. This reduces the memory and communication complexity from O(Ld)O(Ld)O(Ld) to O(Nd)O(Nd)O(Nd). The intra-block accumulation is defined as bn=jBnfj(hj)b_n = \sum_{j \in \mathcal{B}_n} f_j(h_j)bn=jBnfj(hj), where Bn\mathcal{B}_nBn is the set of layers in block nnn. Inter-block attention then operates on these block summaries, allowing layers within a block to attend to previous blocks and the partial sum of the current block.

For efficient training at scale, the method incorporates specific infrastructure optimizations to handle the communication of block representations across pipeline stages. As illustrated in the pipeline communication example, cross-stage caching is employed to eliminate redundant data transfers.

In this setup, blocks received during earlier virtual stages are cached locally. Consequently, stage transitions only transmit incremental blocks accumulated since the receiver's corresponding chunk in the previous virtual stage, rather than the full history. This caching strategy reduces the peak per-transition communication cost significantly, enabling the method to function as a practical drop-in replacement for standard residual connections with marginal training overhead. Additionally, a two-phase computation strategy is used during inference to amortize cross-block attention costs, further minimizing latency.

Experiment

  • Scaling law experiments validate that both Full and Block Attention Residuals (AttnRes) consistently outperform the PreNorm baseline across all model sizes, with Block AttnRes recovering most of the performance gains of the full variant while maintaining lower memory overhead.
  • Main training results demonstrate that AttnRes resolves the PreNorm dilution problem by bounding hidden-state growth and achieving a more uniform gradient distribution, leading to superior performance on multi-step reasoning, code generation, and knowledge benchmarks.
  • Ablation studies confirm that input-dependent weighting via softmax attention is critical for performance, while block-wise aggregation offers an effective trade-off between memory efficiency and the ability to access distant layers compared to sliding-window or full cross-layer access.
  • Architecture sweeps reveal that AttnRes shifts the optimal design preference toward deeper and narrower networks compared to standard Transformers, indicating that the method more effectively leverages increased depth for information flow.
  • Analysis of learned attention patterns shows that AttnRes preserves locality while establishing learned skip connections to early layers and the token embedding, with Block AttnRes successfully maintaining these structural benefits through implicit regularization.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp