il y a 2 mois

Table des matières

Résumé

L’attention en temps linéaire et les modèles à espace d’état (State Space Models, SSM) offrent une promesse de résolution du goulot d’étranglement lié au coût quadratique des modèles linguistiques à long contexte utilisant l’attention softmax. Nous introduisons Error-Free Linear Attention (EFLA), une formulation généralisée, numériquement stable et entièrement parallélisable de la règle du delta. Plus précisément, nous formulons la mise à jour d’apprentissage en ligne comme un système dynamique en temps continu, et prouvons que sa solution exacte est non seulement atteignable, mais aussi calculable en temps linéaire avec une parallélisation complète. En exploitant la structure de rang 1 de la matrice dynamique, nous dérivons directement une solution analytique exacte, qui correspond effectivement à une méthode de Runge-Kutta d’ordre infini. Ce mécanisme d’attention est théoriquement exempt d’accumulation d’erreurs, capturant parfaitement la dynamique continue tout en préservant une complexité en temps linéaire. À travers une vaste série d’expériences, nous démontrons que EFLA permet des performances robustes dans des environnements bruités, atteignant une perplexité de modélisation linguistique inférieure et des performances supérieures sur des benchmarks downstream par rapport à DeltaNet, sans introduire de paramètres supplémentaires. Ce travail établit une nouvelle fondation théorique pour la conception de modèles d’attention en temps linéaire, à haute fidélité et évolutifs.

One-sentence Summary

Nanyang Technological University and Fudan University researchers propose Error-Free Linear Attention (EFLA), which eliminates discretization errors in linear attention by deriving the exact closed-form solution of continuous-time dynamics through rank-1 matrix properties, achieving linear-time complexity without error accumulation and demonstrating superior robustness in noisy environments, lower perplexity, and better benchmark performance than DeltaNet without additional parameters.

Key Contributions

Identifies that existing linear attention methods suffer from numerical instability due to low-order discretization of continuous-time dynamics, causing truncation errors especially in long-context scenarios where Euler-based approximations fail. This explains performance degradation in noisy environments and extended sequences.
Reformulates linear attention as a continuous-time dynamical system governed by a first-order ordinary differential equation, revealing that standard implementations correspond to suboptimal numerical integration schemes like Euler discretization. This theoretical perspective bridges attention mechanisms with continuous-time system modeling.
Derives an exact closed-form solution for the rank-1 dynamics matrix that eliminates discretization errors while maintaining linear-time complexity, validated through language modeling perplexity improvements and superior downstream benchmark performance over DeltaNet without additional parameters.

Introduction

Long-context modeling is critical for efficiently processing lengthy sequences in applications like language understanding, where standard attention mechanisms become computationally prohibitive at scale due to quadratic complexity. Prior approaches such as linear attention often face numerical instability from approximate discretization of continuous dynamics, introducing errors that degrade performance. The authors address this by proving that rank-1 linear attention admits an exact, error-free discretization when derived from its continuous-time formulation, providing a rigorous theoretical foundation to enhance the reliability of existing linear attention implementations without proposing new architectural primitives. This insight offers a pathway to more stable long-context models while complementing alternative linear-time frameworks like RetNet or Hyena.

Top Figure

Method

The authors leverage a continuous-time dynamical systems perspective to reformulate linear attention as an exact, error-free solution to a first-order ordinary differential equation (ODE). Rather than relying on low-order numerical approximations such as Euler or Runge-Kutta discretizations, they derive a closed-form analytical solution that captures the continuous evolution of the attention state without truncation error. This solution is made computationally tractable by exploiting the rank-1 structure of the underlying dynamics matrix, enabling linear-time complexity while preserving mathematical fidelity.

The core formulation begins by interpreting the DeltaNet update — which minimizes a reconstruction loss via gradient descent — as a discretization of the continuous-time ODE:

\frac{d\mathbf{S}(t)}{dt} = -\mathbf{A}_t \mathbf{S}(t) + \mathbf{b}_t,

where $\mathbf{A}_t = \mathbf{k}_t \mathbf{k}_t^\top$ and $\mathbf{b}_t = \mathbf{k}_t \mathbf{v}_t^\top$ . Under the Zero-Order Hold assumption for discrete input sequences, this ODE governs the evolution of the state matrix $\mathbf{S}(t)$ , which accumulates key-value associations over time. Standard linear attention methods correspond to first-order Euler integration of this system, introducing local truncation errors of $\mathcal{O}(\beta_t^2)$ and suffering from instability under stiff dynamics.

To eliminate these errors, the authors derive the exact solution to the ODE by taking the infinite-order limit of the Runge-Kutta family. This yields:

\mathbf{S}_t = e^{-\beta_t\mathbf{A}_t} \mathbf{S}_{t-1} + \int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau.

While matrix exponentials typically incur $\mathcal{O}(d^3)$ cost, the rank-1 property of $\mathbf{A}_t$ allows for a closed-form simplification. Specifically, $\mathbf{A}_t^n = \lambda_t^{n-1} \mathbf{A}_t$ for $n \geq 1$ , where $\lambda_t = \mathbf{k}_t^\top \mathbf{k}_t$ . This enables the exponential to be collapsed into:

e^{-\beta_t\mathbf{A}_t} = \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{A}_t.

Similarly, the integral term simplifies due to the identity $\mathbf{A}_t \mathbf{b}_t = \lambda_t \mathbf{b}_t$ , yielding:

\int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau = \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{b}_t.

Combining these results, the final Error-Free Linear Attention (EFLA) update rule becomes:

\mathbf{S}_t = \left( \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{k}_t^\top \right) \mathbf{S}_{t-1} + \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{v}_t^\top.

This update retains the same algebraic structure as DeltaNet, enabling seamless adoption of existing hardware-efficient parallelization techniques. The authors further derive a chunkwise parallel formulation by unrolling the recurrence and expressing the state as a product of decay operators and accumulated inputs. This allows for efficient batched computation over sequence chunks, maintaining $\mathcal{O}(Ld^2)$ complexity while enabling full parallelism.

The spectral properties of $\mathbf{A}_t$ also reveal an implicit gating mechanism: the key norm $\lambda_t$ controls the decay rate along the direction of $\mathbf{k}_t$ . Large $\lambda_t$ induces rapid forgetting, while small $\lambda_t$ results in near-linear decay, effectively prioritizing retention of historical context. In the limit $\lambda_t \to 0$ , EFLA recovers the delta rule, confirming that prior linear attention methods are first-order approximations valid only under non-stiff dynamics.

By grounding the attention mechanism in continuous-time dynamics and deriving its exact solution, EFLA eliminates the numerical error inherent in discretized approximations, offering a theoretically grounded, scalable, and stable alternative to existing linear attention formulations.

Experiment

Numerical stability tests on sMNIST: EFLA maintains significantly higher accuracy than DeltaNet under pixel dropout, OOD intensity scaling, and additive Gaussian noise, especially with a learning rate of 3e-3, validating its robustness against error accumulation and state explosion.
Language modeling on Wikitext and LAMBADA: EFLA achieves 81.28 perplexity (vs. DeltaNet's 96.26) and 23.9% accuracy on LAMBADA, while surpassing DeltaNet by +7.4% absolute accuracy on BoolQ, confirming superior long-sequence information fidelity.
Learning rate analysis: EFLA requires a larger learning rate (3e-3) to counteract saturation effects, empirically validated by improved robustness under interference compared to conservative rates (1e-4).

The authors compare EFLA and DeltaNet on language modeling and reasoning tasks using 340M and 1.3B parameter models. Results show EFLA consistently outperforms DeltaNet across most benchmarks, achieving lower perplexity on Wikitext and LAMBADA and higher accuracy on tasks like BoolQ and SciQ, with the performance gap widening at larger scale. This improvement is attributed to EFLA’s exact decay mechanism, which preserves long-range context fidelity more effectively than DeltaNet’s Euler-based approximation.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

HyperAI

il y a 2 mois

Transformer

LLM

Entraînement Du Modèle

Approche/Framework

Jingdi Lei Di Zhang Soujanya Poria

Table des matières

Résumé

One-sentence Summary

Key Contributions

Identifies that existing linear attention methods suffer from numerical instability due to low-order discretization of continuous-time dynamics, causing truncation errors especially in long-context scenarios where Euler-based approximations fail. This explains performance degradation in noisy environments and extended sequences.
Reformulates linear attention as a continuous-time dynamical system governed by a first-order ordinary differential equation, revealing that standard implementations correspond to suboptimal numerical integration schemes like Euler discretization. This theoretical perspective bridges attention mechanisms with continuous-time system modeling.
Derives an exact closed-form solution for the rank-1 dynamics matrix that eliminates discretization errors while maintaining linear-time complexity, validated through language modeling perplexity improvements and superior downstream benchmark performance over DeltaNet without additional parameters.

Introduction

Top Figure

Method

The core formulation begins by interpreting the DeltaNet update — which minimizes a reconstruction loss via gradient descent — as a discretization of the continuous-time ODE:

\frac{d\mathbf{S}(t)}{dt} = -\mathbf{A}_t \mathbf{S}(t) + \mathbf{b}_t,

To eliminate these errors, the authors derive the exact solution to the ODE by taking the infinite-order limit of the Runge-Kutta family. This yields:

\mathbf{S}_t = e^{-\beta_t\mathbf{A}_t} \mathbf{S}_{t-1} + \int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau.

e^{-\beta_t\mathbf{A}_t} = \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{A}_t.

Similarly, the integral term simplifies due to the identity $\mathbf{A}_t \mathbf{b}_t = \lambda_t \mathbf{b}_t$ , yielding:

\int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau = \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{b}_t.

Combining these results, the final Error-Free Linear Attention (EFLA) update rule becomes:

\mathbf{S}_t = \left( \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{k}_t^\top \right) \mathbf{S}_{t-1} + \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{v}_t^\top.

Experiment

Numerical stability tests on sMNIST: EFLA maintains significantly higher accuracy than DeltaNet under pixel dropout, OOD intensity scaling, and additive Gaussian noise, especially with a learning rate of 3e-3, validating its robustness against error accumulation and state explosion.
Language modeling on Wikitext and LAMBADA: EFLA achieves 81.28 perplexity (vs. DeltaNet's 96.26) and 23.9% accuracy on LAMBADA, while surpassing DeltaNet by +7.4% absolute accuracy on BoolQ, confirming superior long-sequence information fidelity.
Learning rate analysis: EFLA requires a larger learning rate (3e-3) to counteract saturation effects, empirically validated by improved robustness under interference compared to conservative rates (1e-4).

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

L'attention linéaire sans erreur est un repas gratuit : solution exacte issue de dynamiques en temps continu

Jingdi Lei Di Zhang Soujanya Poria

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

L'attention linéaire sans erreur est un repas gratuit : solution exacte issue de dynamiques en temps continu

Jingdi Lei Di Zhang Soujanya Poria

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

L'attention linéaire sans erreur est un repas gratuit : solution exacte issue de dynamiques en temps continu

Jingdi Lei Di Zhang Soujanya Poria

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters