il y a 13 heures

Zhentao Tan Wei Chen Jingyi Shen Yao Liu Xu Shen Yue Wu Jieping Ye

Table des matières

Résumé

La complexité quadratique de l’attention constitue un goulot d’étranglement critique pour le traitement de contextes longs, stimulant l’intérêt pour des architectures d’attention hybrides. La plupart des modèles hybrides open source adoptent une stratégie au niveau des couches (layer-wise). Pourtant, des travaux antérieurs ont souligné la difficulté intrinsèque à intégrer l’Attention Linéaire (Linear Attention, LA) avec l’Attention Complète (Full Attention, FA), suggérant que l’espace de conception de l’hybridation des attentions reste peu exploré. Afin d’investiguer cet espace, nous menons une analyse d’interprétabilité et observons que les couches présentent une similarité fonctionnelle par blocs, tandis que les têtes (heads) individuelles au sein d’une même couche affichent une spécialisation fonctionnelle distincte, bien qu’elles partagent les mêmes caractéristiques d’entrée (input features). Cette hétérogénéité au niveau des têtes suggère que la dimension des têtes offre une granularité naturelle et principiellement fondée pour fusionner des signaux d’attention hétérogènes. Forts de cette intuition, nous présentons HydraHead, une nouvelle architecture qui hybride FA et LA le long de l’axe des têtes (head axis). HydraHead se distingue par deux innovations majeures : (1) une stratégie de sélection pilotée par l’interprétabilité qui identifie les têtes critiques pour la récupération d’information et conserve l’FA uniquement pour celles-ci, et (2) un module de fusion normalisé par échelle (scale-normalized) qui résout l’écart distributionnel entre les sorties des têtes FA et LA. En exploitant un pipeline de transfert en trois étapes basé sur la réutilisation des paramètres et la distillation, nous obtenons des modèles hybrides performants avec un coût d’entraînement minimal. Dans un cadre d’entraînement unifié, HydraHead surpasse les autres conceptions hybrides sur les tâches à contextes longs, tout en maintenant de solides capacités de raisonnement général. Grâce à sa sélection de têtes pilotée par l’interprétabilité, il égale les performances en contexte long d’un hybridation layer-wise avec un ratio de 3:1, mais avec un ratio LA/FA de 7:1. De manière cruciale, entraîné sur seulement 15 milliards de tokens, HydraHead affiche une amélioration de plus de 69 % par rapport à la ligne de base à une longueur de contexte de 512K, se rapprochant ainsi de Qwen3.5, un modèle phare de taille comparable doté d’une longueur de contexte native de 256K. Cela met en évidence le potentiel de mise à l’échelle (scaling) significatif offert par l’hybridation au niveau des têtes.

One-sentence Summary

HydraHead, a novel architecture hybridizing full and linear attention at the head level via interpretability-driven selection of retrieval-critical heads and a scale-normalized fusion module, trained on only 15B tokens, outperforms layer-wise hybrid designs in long-context tasks, achieving over 69% improvement over the baseline at 512K context length and matching a 3:1 layer-wise hybrid’s long-context performance at a 7:1 linear-to-full attention ratio.

Key Contributions

HydraHead is a hybrid attention architecture that mixes full attention and linear attention at the head level, using an interpretability-driven selection strategy to reserve full attention only for retrieval-critical heads.
A scale-normalized fusion module reconciles the distributional mismatch between full and linear attention head outputs, enabling stable integration of their complementary signals.
A three-stage transfer pipeline with parameter reuse and distillation yields high-performance hybrids with minimal training overhead; trained on 15B tokens, HydraHead improves over the baseline by more than 69% at 512K context, approaches Qwen3.5, and matches a 3:1 layer-wise hybrid’s long-context performance with a 7:1 linear-to-full attention ratio.

Introduction

The authors examine the challenge of extending LLM context windows for autonomous reasoning agents, where standard full attention scales quadratically and pure linear attention often suffers expressivity collapse on precise retrieval tasks. Prior hybridization efforts primarily adopt layer-wise designs, but layer outputs vary smoothly, offering weak signal for placing different attention mechanisms, and training hybrids that combine full and linear attention has proven difficult. The key insight is that attention heads within a layer exhibit sharp functional heterogeneity: only a sparse subset drives token-level retrieval, while the rest remain largely inactive. The authors leverage this head-level specialization to propose HydraHead, a fine-grained architecture that uses interpretability-based selection to assign full attention solely to retrieval-critical heads and linear attention to the rest, combined with a head-wise scale-normalized fusion to mitigate interference, enabling aggressive full-attention compression and state-of-the-art long-context performance with little loss in reasoning.

Method

The authors leverage causal intervention techniques to determine which attention heads require Full Attention (FA) and which can be approximated by Linear Attention (LA). This process involves three steps. First, they use activation patching to measure the direct causal effect of individual heads on target behaviors, identifying receiver heads that write critical signals into the residual stream. Second, they employ path patching to trace upstream contributions, identifying sender heads that feed into the receivers. Finally, they fuse these per-head scores across multiple capabilities to create a unified ranking. The top-ranked heads are assigned to the FA branch to preserve precise retrieval capabilities, while the remaining heads are assigned to the LA branch.

To harness the complementary strengths of both mechanisms, the authors introduce a Head-wise Hybrid Attention mechanism. As shown in the figure below:

Instead of applying a uniform attention pattern across all layers or tokens, the model selectively assigns each query head to either the FA branch or the Gated DeltaNet (GDN) branch based on the functional importance estimated previously. This fine-grained hybridization allows the model to retain critical reasoning capabilities in specific heads via FA, while leveraging the efficiency of GDN for context extension.

The internal structural designs of the two branches are specifically refined to support this hybridization. As shown in the figure below:

For the FA branch, the authors remove Rotary Position Embedding (RoPE) and apply a log-scale coefficient to query features to stabilize attention distributions in long contexts. They also introduce an auxiliary gate branch to alleviate the attention sink phenomenon. For the GDN branch, they explicitly integrate RoPE into query and key projections to enhance positional awareness within the receptive field and expand the number of key-value heads to match the query heads, transitioning to a Multi-Head Attention-like configuration.

The authors also contextualize this design against other hybrid strategies. As shown in the figure below:

They compare head-wise selection against layer-wise and token-wise hybrids, demonstrating that head-level granularity provides a natural space for fusing heterogeneous attention signals.

A fundamental challenge in fusing FA and LA outputs is the distributional gap. FA produces sharp, low-entropy distributions modulated by query norm, while LA yields smoother, higher-entropy representations. To reconcile this, the authors propose a scale-normalized fusion module. They apply RMSNorm independently to each head's output $\mathbf{O}_h$ to unify feature scales:

$\hat{\mathbf{O}}_h = \text{Norm}(\mathbf{O}_h)$

These normalized outputs are concatenated along the head dimension according to their original indices to maintain functional identity. Finally, a learnable head-wise scaling vector $\boldsymbol{\gamma}$ is applied to adaptively recalibrate the contribution of each head:

$\tilde{\mathbf{O}}_{:, h:} = \boldsymbol{\gamma}_h \cdot \hat{\mathbf{O}}_{:, h:}$

The modulated tensor is then reshaped and projected to produce the final attention output.

Transforming a pre-trained Transformer into a hybrid architecture requires a principled initialization to avoid optimization instability. The authors adopt a three-stage transfer pipeline.

In the first stage, they focus on parameter migration and layer-wise output alignment. For FA heads, they introduce a gate branch initialized to approximate an identity function. For GDN heads, they reuse the Q, K, V projection weights from the pre-trained FA layers, using channel-wise repetition to handle dimension mismatches for key-value heads. They freeze the pre-trained backbone and optimize the hybrid layers using a Mean Squared Error loss to align the hidden states of the hybrid layers with the original FA layers:

$\mathcal{L}_{align} = \sum_{l=1}^L || \mathbf{H}_{FA}^{(l)}(x) - \mathbf{H}_{Hybrid}^{(l)}(x) ||_2^2$

In the second stage, they perform global logits distillation. The entire model is unfrozen, and the student model's output distribution is aligned with the teacher model using KL divergence loss combined with cross-entropy loss. This ensures global semantic coherence and recovers performance degradation from the linear approximation.

In the final stage, the authors conduct long-context fine-tuning. Using the Next Token Prediction objective with a longer context length, the model consolidates its long-context capabilities, building upon the stable initialization from the previous stages.

Experiment

The HydraHead architecture is evaluated on Qwen3-1.7B against layer-wise, token-wise, and other head-wise hybrids using long-context retrieval (RULER) and general reasoning benchmarks, with ablations examining structural components, fusion designs, and head selection strategies. Interpretability-guided global head selection, which identifies a sparse subset of critical attention heads scattered across layers, coupled with head-wise scale modulation and an optimized three-stage transfer learning pipeline, consistently yields the best balance between long-context extrapolation and reasoning capability. Scaling to over 15B tokens shows that HydraHead surpasses standard Transformers and other hybrid models at extreme context lengths while preserving strong general reasoning, validating that head-level specialization is the right axis for efficient hybridization.

The authors investigate feature fusion strategies for their head-wise hybrid architecture, comparing direct concatenation, head-wise scale modulation, and head-wise gated competition. Results indicate that feature normalization is essential, as removing it leads to significant performance drops in long-context retrieval tasks. Head-wise scale modulation emerges as the most effective fusion method, consistently outperforming gated competition, particularly in extended context scenarios. Removing feature normalization causes severe degradation in long-context retrieval performance. Head-wise scale modulation outperforms gated competition across nearly all evaluation metrics. Scale modulation provides substantial gains in long-context extension scenarios compared to dynamic gating.

The authors compare their proposed head-wise hybrid model against standard Transformers and other hybrid architectures on long-context retrieval benchmarks. Results show that while standard models and competing hybrids suffer severe performance degradation or complete failure at extended context lengths, the proposed method maintains robust retrieval capabilities across both single and multi-key tasks. The proposed model sustains high accuracy on single-key retrieval tasks even at the longest evaluated context lengths, whereas other hybrid models and standard Transformers drop to near-zero performance. On multi-key retrieval tasks, the proposed approach significantly outperforms existing hybrid architectures at extended context lengths, demonstrating superior scalability. Standard Transformer models without specific length extrapolation enhancements lose their retrieval capability beyond their native context window, highlighting the advantage of the head-wise hybrid design.

The authors compare their head-wise hybrid model using a 7:1 ratio against a layer-wise hybrid baseline using a 3:1 ratio. Results show that the head-wise approach achieves broadly comparable long-context retrieval performance despite using a much higher proportion of linear attention. Furthermore, the head-wise model demonstrates a substantial advantage in general reasoning capabilities, highlighting the effectiveness of interpretability-guided head selection in preserving domain knowledge under aggressive sparsity. The head-wise hybrid model with a 7:1 ratio maintains long-context retrieval performance comparable to the layer-wise hybrid with a 3:1 ratio. The head-wise approach significantly outperforms the layer-wise baseline on both hard and easy general reasoning benchmarks. Interpretability-guided head selection allows for aggressive linear attention ratios without severely compromising general-domain capabilities.

The authors conduct an ablation study to evaluate the contribution of various structural components in their head-wise hybrid architecture. Progressive integration of modules such as positional scaling, gating mechanisms, and multi-head attention configurations significantly enhances long-context retrieval capabilities while maintaining general reasoning performance. The final configuration with query decomposition achieves the best balance across all evaluation dimensions. Replacing positional encoding in the full attention branch with scale modulation substantially improves long-context retrieval but initially reduces general reasoning performance. Incorporating multi-head attention in the linear attention branch helps recover the lost general reasoning capabilities. The complete model configuration with query decomposition achieves optimal performance on single-key retrieval tasks while maintaining robust results across other benchmarks.

The authors compare their proposed HydraHead architecture against layer-wise, token-wise, and head-wise hybrid transformers to evaluate performance trade-offs. Results indicate that while layer-wise hybrids favor long-context tasks and token-wise or head-wise hybrids excel at general reasoning, HydraHead achieves the best overall balance. It delivers state-of-the-art performance in long-context retrieval while maintaining robust general reasoning capabilities with a moderate cache footprint. HydraHead achieves the highest performance in both native and extended context single-key retrieval tasks, significantly outperforming other hybrid architectures. Token-wise and head-wise hybrids demonstrate substantially better general reasoning capabilities compared to layer-wise hybrids, though they struggle with long-context extrapolation. The proposed method maintains strong general reasoning performance while using a smaller key-value cache size compared to full head-wise mixing baselines.

The experiments evaluate head-wise hybrid attention architectures for long-context retrieval and general reasoning. A comparison of fusion strategies finds that head-wise scale modulation with feature normalization provides the best scalability, consistently surpassing gated competition at extended context lengths. Benchmarks against standard Transformers and other hybrids confirm that the proposed design maintains robust single- and multi-key retrieval where alternatives fail, while interpretability-guided head selection allows aggressive linear attention ratios without degrading general reasoning. Ablation studies demonstrate that integrating multi-head attention in the linear branch, positional scaling, and query decomposition progressively balances long-context and reasoning performance, and the final HydraHead achieves state-of-the-art retrieval with strong general capabilities and a smaller cache footprint.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 13 heures

Apprentissage Profond

Zhentao Tan Wei Chen Jingyi Shen Yao Liu Xu Shen Yue Wu Jieping Ye

Table des matières

Résumé

One-sentence Summary

Key Contributions

HydraHead is a hybrid attention architecture that mixes full attention and linear attention at the head level, using an interpretability-driven selection strategy to reserve full attention only for retrieval-critical heads.
A scale-normalized fusion module reconciles the distributional mismatch between full and linear attention head outputs, enabling stable integration of their complementary signals.
A three-stage transfer pipeline with parameter reuse and distillation yields high-performance hybrids with minimal training overhead; trained on 15B tokens, HydraHead improves over the baseline by more than 69% at 512K context, approaches Qwen3.5, and matches a 3:1 layer-wise hybrid’s long-context performance with a 7:1 linear-to-full attention ratio.

Introduction

Method

To harness the complementary strengths of both mechanisms, the authors introduce a Head-wise Hybrid Attention mechanism. As shown in the figure below:

The internal structural designs of the two branches are specifically refined to support this hybridization. As shown in the figure below:

The authors also contextualize this design against other hybrid strategies. As shown in the figure below:

They compare head-wise selection against layer-wise and token-wise hybrids, demonstrating that head-level granularity provides a natural space for fusing heterogeneous attention signals.

$\hat{\mathbf{O}}_h = \text{Norm}(\mathbf{O}_h)$

$\tilde{\mathbf{O}}_{:, h:} = \boldsymbol{\gamma}_h \cdot \hat{\mathbf{O}}_{:, h:}$

The modulated tensor is then reshaped and projected to produce the final attention output.

Transforming a pre-trained Transformer into a hybrid architecture requires a principled initialization to avoid optimization instability. The authors adopt a three-stage transfer pipeline.

$\mathcal{L}_{align} = \sum_{l=1}^L || \mathbf{H}_{FA}^{(l)}(x) - \mathbf{H}_{Hybrid}^{(l)}(x) ||_2^2$

Experiment

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

HydraHead : De l'hétérogénéité fonctionnelle de niveau tête à l'hybridation spécialisée de l'attention

Zhentao Tan Wei Chen Jingyi Shen Yao Liu Xu Shen Yue Wu Jieping Ye

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

HydraHead : De l'hétérogénéité fonctionnelle de niveau tête à l'hybridation spécialisée de l'attention

Zhentao Tan Wei Chen Jingyi Shen Yao Liu Xu Shen Yue Wu Jieping Ye

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

HydraHead : De l'hétérogénéité fonctionnelle de niveau tête à l'hybridation spécialisée de l'attention

Zhentao Tan Wei Chen Jingyi Shen Yao Liu Xu Shen Yue Wu Jieping Ye

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters