Command Palette
Search for a command to run...
Résidus d'attention
Résidus d'attention
Résumé
Les connexions résiduelles avec PreNorm constituent une norme dans les LLM modernes, mais elles accumulent les sorties de toutes les couches avec des poids unitaires fixes. Cette agrégation uniforme entraîne une croissance incontrôlée des états cachés avec la profondeur, diluant progressivement la contribution de chaque couche. Nous proposons les Résidus d'Attention (AttnRes), qui remplacent cette accumulation fixe par une attention softmax appliquée aux sorties des couches précédentes, permettant à chaque couche d'agréger de manière sélective les représentations antérieures avec des poids appris et dépendants de l'entrée. Pour répondre aux surcoûts en mémoire et en communication liés à l'attention sur l'ensemble des sorties des couches précédentes lors de l'entraînement de modèles à grande échelle, nous introduisons le Block AttnRes. Cette approche partitionne les couches en blocs et applique l'attention aux représentations au niveau du bloc, réduisant ainsi l'empreinte mémoire tout en préservant la majeure partie des gains apportés par l'AttnRes complet. Combiné à une communication de pipeline basée sur la mise en cache et à une stratégie de calcul en deux phases, le Block AttnRes devient un remplacement pratique et direct des connexions résiduelles standard, avec un surcoût minimal. Les expériences sur les lois d'échelle confirment que l'amélioration est cohérente quelle que soit la taille du modèle, tandis que les études d'ablation valident l'avantage d'une sélection en profondeur dépendante du contenu. Nous avons ensuite intégré l'AttnRes dans l'architecture Kimi Linear (48 milliards de paramètres au total, dont 3 milliards activés) et effectué un pré-entraînement sur 1,4 billion de tokens. Dans ce contexte, l'AttnRes atténue la dilution induite par le PreNorm, générant des magnitudes de sortie et une distribution des gradients plus uniformes à travers la profondeur, tout en améliorant les performances en aval sur l'ensemble des tâches évaluées.
One-sentence Summary
The Kimi Team proposes Attention Residuals, a novel mechanism replacing fixed residual weights with learned softmax attention to mitigate hidden-state dilution in large language models. Their optimized Block AttnRes variant reduces memory overhead while significantly improving training stability and downstream task performance across various model scales.
Key Contributions
- The paper introduces Attention Residuals (AttnRes), a mechanism that replaces fixed unit-weight accumulation with learned softmax attention over preceding layer outputs to enable selective, content-dependent aggregation of representations across depth.
- To address scalability, the work presents Block AttnRes, which partitions layers into blocks and attends over block-level summaries to reduce memory and communication complexity from O(Ld) to O(Nd) while preserving performance gains.
- Comprehensive experiments on a 48B-parameter model pre-trained on 1.4T tokens demonstrate that the method mitigates hidden-state dilution, yields more uniform gradient distributions, and consistently improves downstream task performance compared to standard residual connections.
Introduction
The research addresses the challenge of efficiently scaling large language models while maintaining high performance, a critical need for deploying advanced AI in real-world applications. Prior approaches often struggle with the computational overhead of attention mechanisms or fail to fully utilize residual connections for stable training at scale. To overcome these limitations, the authors introduce a novel framework centered on attention residuals that optimizes information flow and reduces training costs without sacrificing model quality.
Method
The authors propose Attention Residuals (AttnRes) to address the limitations of standard residual connections in deep networks. In standard architectures, the hidden state update follows a fixed recurrence hl=hl−1+fl−1(hl−1), which unrolls to a uniform sum of all preceding layer outputs. This fixed aggregation causes hidden-state magnitudes to grow linearly with depth, diluting the contribution of individual layers. AttnRes replaces this fixed accumulation with a learned, input-dependent attention mechanism over depth.
Refer to the framework diagram for a visual comparison of the residual connection variants.

As shown in the figure, the standard approach (a) simply adds the previous layer output. In contrast, Full AttnRes (b) allows each layer to selectively aggregate all previous layer outputs via learned attention weights. Specifically, the input to layer l is computed as hl=∑i=0l−1αi→l⋅vi, where vi represents the output of layer i (or the embedding for i=0). The attention weights αi→l are derived from a softmax over a kernel function ϕ(ql,ki), where ql=wl is a learnable pseudo-query vector specific to layer l, and ki=vi are the keys derived from previous outputs. This mechanism enables content-aware retrieval across depth with minimal parameter overhead.
To make this approach scalable for large models, the authors introduce Block AttnRes (c). This variant partitions the L layers into N blocks. Within each block, layer outputs are reduced to a single representation via summation, and attention is applied only over the N block-level representations. This reduces the memory and communication complexity from O(Ld) to O(Nd). The intra-block accumulation is defined as bn=∑j∈Bnfj(hj), where Bn is the set of layers in block n. Inter-block attention then operates on these block summaries, allowing layers within a block to attend to previous blocks and the partial sum of the current block.
For efficient training at scale, the method incorporates specific infrastructure optimizations to handle the communication of block representations across pipeline stages. As illustrated in the pipeline communication example, cross-stage caching is employed to eliminate redundant data transfers.

In this setup, blocks received during earlier virtual stages are cached locally. Consequently, stage transitions only transmit incremental blocks accumulated since the receiver's corresponding chunk in the previous virtual stage, rather than the full history. This caching strategy reduces the peak per-transition communication cost significantly, enabling the method to function as a practical drop-in replacement for standard residual connections with marginal training overhead. Additionally, a two-phase computation strategy is used during inference to amortize cross-block attention costs, further minimizing latency.
Experiment
- Scaling law experiments validate that both Full and Block Attention Residuals (AttnRes) consistently outperform the PreNorm baseline across all model sizes, with Block AttnRes recovering most of the performance gains of the full variant while maintaining lower memory overhead.
- Main training results demonstrate that AttnRes resolves the PreNorm dilution problem by bounding hidden-state growth and achieving a more uniform gradient distribution, leading to superior performance on multi-step reasoning, code generation, and knowledge benchmarks.
- Ablation studies confirm that input-dependent weighting via softmax attention is critical for performance, while block-wise aggregation offers an effective trade-off between memory efficiency and the ability to access distant layers compared to sliding-window or full cross-layer access.
- Architecture sweeps reveal that AttnRes shifts the optimal design preference toward deeper and narrower networks compared to standard Transformers, indicating that the method more effectively leverages increased depth for information flow.
- Analysis of learned attention patterns shows that AttnRes preserves locality while establishing learned skip connections to early layers and the token embedding, with Block AttnRes successfully maintaining these structural benefits through implicit regularization.