Command Palette
Search for a command to run...
Mixture-of-Depths Attention
Mixture-of-Depths Attention
Zusammenfassung
Die Skalierung der Netzwerktiefe ist ein zentraler Treiber für Large Language Models (LLMs). Mit zunehmender Tiefe leiden LLMs jedoch häufig unter einer Signaldegradation: In den flachen Schichten gebildete informative Merkmale werden durch wiederholte residuelle Aktualisierungen allmählich verwässert, was ihre Wiederherstellung in tieferen Schichten erschwert. Wir führen die Mixture-of-Depths-Attention (MoDA) ein, einen Mechanismus, der es jedem Attention-Head ermöglicht, sowohl KV-Paare der aktuellen Schicht als auch KV-Paare aus vorhergehenden Schichten zu berücksichtigen. Zudem stellen wir einen hardware-effizienten Algorithmus für MoDA vor, der nicht-kontinuierliche Speicherzugriffsmuster auflöst und bei einer Sequenzlänge von 64K eine Effizienz von 97,3 % im Vergleich zu FlashAttention-2 erreicht. Experimente an Modellen mit 1,5 Milliarden Parametern zeigen, dass MoDA konsistent starke Baselines übertrifft. Insbesondere verbessert es die durchschnittliche Perplexität um 0,2 über 10 Validierungs-Benchmarks hinweg und steigert die durchschnittliche Leistung um 2,11 % bei 10 Downstream-Aufgaben, bei einem vernachlässigbaren Rechenoverhead von lediglich 3,7 % an FLOPs. Ferner stellen wir fest, dass die Kombination von MoDA mit Post-Norm eine bessere Leistung erzielt als die Verwendung mit Pre-Norm. Diese Ergebnisse deuten darauf hin, dass MoDA ein vielversprechendes primitives Bauelement für die Skalierung der Tiefe darstellt. Der Quellcode ist unter https://github.com/hustvl/MoDA verfügbar.
One-sentence Summary
Researchers from Huazhong University of Science & Technology and ByteDance propose Mixture-of-Depths Attention (MoDA), a hardware-efficient mechanism that mitigates information dilution in deep large language models by enabling dynamic cross-layer retrieval, thereby improving downstream performance with minimal computational overhead.
Key Contributions
- The paper introduces mixture-of-depths attention (MoDA), a unified mechanism that enables each attention head to dynamically attend to both current sequence key-value pairs and depth key-value pairs from preceding layers to mitigate information dilution.
- A hardware-efficient fused algorithm is presented that resolves non-contiguous memory access patterns through chunk-aware layouts and group-aware indexing, achieving 97.3% of FlashAttention-2 efficiency at a 64K sequence length.
- Extensive experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms the OLMo2 baseline, improving average perplexity by 0.2 across 10 validation benchmarks and increasing downstream task performance by 2.11% with only 3.7% additional FLOPs.
Introduction
Scaling the depth of large language models is critical for enhancing representational capacity, yet deeper networks often suffer from information dilution where valuable features from shallow layers degrade as they pass through repeated residual updates. Prior attempts to fix this using dense cross-layer connections preserve history but introduce prohibitive parameter growth, while standard residual pathways fail to prevent signal loss without adding significant overhead. The authors introduce Mixture-of-Depths Attention (MoDA), a mechanism that allows each attention head to dynamically retrieve key-value pairs from both the current sequence and preceding layers to recover lost information. They further develop a hardware-efficient fused algorithm that resolves non-contiguous memory access patterns, achieving 97.3% of FlashAttention-2 efficiency at 64K sequence length while delivering consistent performance gains with negligible computational cost.
Method
The authors propose a novel framework for stacking Transformer blocks along the depth stream, conceptualizing each block as a three-step procedure: read, operate, and write. This perspective allows for a systematic exploration of mechanisms that propagate information through depth, moving beyond standard residual connections.
Depth-Stream Mechanisms
The authors first establish a design space by comparing existing and intermediate mechanisms. Standard Depth Residual connections use an identity read and an additive write, which can lead to signal degradation due to repeated superposition. Depth Dense methods mitigate this by concatenating historical representations, though this incurs high computational costs. To balance efficiency and adaptivity, the authors introduce Depth Attention, which reads historical depth information using attention in a data-dependent manner.
The progression of these mechanisms is illustrated in the conceptual comparison of depth-stream designs.

In the Depth Attention formulation, the input to layer l is computed by attending to historical key-value pairs {Ki,Vi}i=0l−1 from the same token position across layers. The output is then projected to new query, key, and value pairs for the next layer.
Mixture-of-Depths Attention (MoDA)
Building upon Depth Attention, the authors propose Mixture-of-Depths Attention (MoDA), which unifies sequence-level and depth-level retrieval into a single softmax operator. MoDA reads the current hidden state Xl−1 and the historical depth key-value stream {(Ki,Vi)}i=0l−1. During the operate step, each token attends to both the sequence-level keys and values and its own historical depth-wise keys and values, with all attention scores normalized jointly.
The architecture of the Transformer Decoder incorporating MoDA and the resulting visible relationships of the attention mechanism are shown below.

At the write step, the current layer's key-value pair is appended to the depth stream for subsequent layers. For the Feed-Forward Network (FFN) layer, a lightweight KV projection is used to generate the corresponding key-value pair. This design allows MoDA to exploit depth history efficiently with substantially lower overhead than dense cross-layer connectivity.
Hardware-Efficient Implementation
To address the memory and bandwidth bottlenecks associated with caching all depth key-value states, the authors develop a hardware-aware implementation. A naive implementation would require non-contiguous reads, degrading GPU utilization. The proposed solution reorganizes depth-stream tensors to enable contiguous memory access and fused computation.
The hardware view of MoDA depth-cache access demonstrates two key layout strategies: Flash-Compatible and Chunk/Group-Aware.

The Flash-Compatible layout flattens the depth cache along a single axis of length T×L, allowing each query to map to a contiguous depth range. However, to further improve depth utilization, the Chunk/Group-Aware layout groups queries by chunk size C. This reduces the effective depth span from T×L to (C×L)/G per chunk, where G is the Group Query Attention (GQA) group number. This reorganization minimizes unnecessary HBM traffic from masked, out-of-range depth entries and aligns query-block boundaries with G to simplify vectorized execution. The implementation follows an online softmax update process, accumulating logits from sequence and depth blocks into a single on-chip state before normalizing.
Experiment
- Efficiency comparisons demonstrate that the proposed MoDA implementation scales predictably, with overhead becoming negligible as sequence length increases or depth utilization rises, while maintaining linear scaling behavior in long-sequence regimes.
- Variant analysis on large language models reveals that injecting depth key-value information significantly improves performance with minimal computational cost, and adding depth projections specifically for FFN layers yields the best accuracy-efficiency trade-off compared to reusing attention projections.
- Scaling experiments confirm that MoDA delivers stable performance gains across different model sizes (700M to 1.5B) and diverse downstream tasks, including commonsense reasoning and broad knowledge benchmarks, while consistently lowering validation perplexity across multiple data domains.
- Layer-number studies indicate that MoDA remains effective in both shallower and deeper model configurations, with depth key-value injection consistently reducing validation loss and providing greater benefits in post-norm settings for deeper stacks.
- Attention visualizations show that the model actively retrieves cross-layer depth information rather than relying solely on sequence context, redistributing attention mass away from typical attention sinks toward more task-relevant sequence and depth locations.
- Kernel implementation ablations validate that combining flash-compatible layouts, chunk-aware designs, and group-aware indexing achieves massive speedups over naive baselines, reducing runtime by over three orders of magnitude.