HyperAIHyperAI

Command Palette

Search for a command to run...

Console

Error-Free Linear Attention ist ein kostenloses Mittagessen: Exakte Lösung aus kontinuierlichen Zeitdynamiken

Jingdi Lei Di Zhang Soujanya Poria

Abstract

Lineare Zeit-Attention und State-Space-Modelle (SSMs) versprechen, die quadratische Kostenbarriere in Sprachmodellen mit langen Kontexten zu überwinden, die Softmax-Attention verwenden. Wir stellen Error-Free Linear Attention (EFLA) vor, eine numerisch stabile, vollständig parallele und verallgemeinerte Formulierung der Delta-Regel. Konkret formulieren wir den Online-Lernupdate als ein kontinuierliches dynamisches System und beweisen, dass seine exakte Lösung nicht nur erreichbar, sondern auch in linearer Zeit mit vollständiger Parallelität berechenbar ist. Durch Ausnutzung der Rang-1-Struktur der Dynamikmatrix leiten wir direkt die exakte geschlossene Lösung ab, die effektiv der unendlich hohen Ordnung eines Runge-Kutta-Verfahrens entspricht. Diese Attention-Mechanismus ist theoretisch frei von Fehlertoleranz, erfasst die kontinuierlichen Dynamiken perfekt und bewahrt gleichzeitig die Linearzeit-Komplexität. In einer umfassenden Reihe von Experimenten zeigen wir, dass EFLA eine robuste Leistung in störanfälligen Umgebungen ermöglicht, wobei sie eine niedrigere Sprachmodellierungs-Perplexität und überlegene Ergebnisse bei nachgeschalteten Benchmarks erzielt, ohne zusätzliche Parameter einzuführen. Unsere Arbeit legt eine neue theoretische Grundlage für die Entwicklung hochgenauer, skalierbarer Linearzeit-Attention-Modelle dar.

One-sentence Summary

Jingdi Lei, Di Zhang, and Soujanya Poria from Nanyang Technological University and Fudan University propose Error-Free Linear Attention (EFLA), a theoretically exact, numerically stable linear-time attention mechanism derived from solving the continuous-time ODE governing linear attention via an infinite-order Runge–Kutta method. By exploiting the rank-1 structure of the dynamics matrix, EFLA achieves closed-form, error-free updates with full parallelism and linear complexity, outperforming DeltaNet in robustness to noise and downstream tasks while eliminating discretization errors inherent in prior Euler-based approximations.

Key Contributions

  • Existing linear attention methods suffer from inherent numerical instability and truncation errors due to their reliance on first-order Euler discretization of continuous-time dynamics, which limits their accuracy and robustness in long-context scenarios despite their computational efficiency.

  • The paper reformulates linear attention as a continuous-time dynamical system governed by a first-order ODE, revealing that the standard approach corresponds to a low-order numerical integration scheme that fails to capture the true evolution of the state.

  • By exploiting the rank-1 structure of the dynamics matrix, the authors derive an exact closed-form solution equivalent to the infinite-order Runge–Kutta limit, achieving error-free integration with linear time complexity, full parallelism, and consistent performance gains over DeltaNet and other baselines.

Introduction

The authors leverage the growing role of large language models as autonomous agents in complex, long-context tasks—such as reasoning and tool use—where standard attention mechanisms become computationally prohibitive due to their quadratic time complexity. Prior linear attention methods, while efficient, rely on low-order numerical approximations like Euler integration to solve the underlying continuous-time dynamics, introducing truncation errors and instability, especially under long sequences or high decay rates. These approximations are inherently limited, with heuristic fixes like gating or adaptive coefficients only mitigating symptoms rather than eliminating the root cause. The authors’ main contribution is EFLA, a principled reformulation of linear attention as a continuous-time dynamical system governed by a first-order ODE. By exploiting the rank-1 structure of the system, they derive an exact closed-form solution equivalent to the infinite-order Runge–Kutta limit, achieving error-free integration while preserving linear time complexity. This approach not only ensures numerical stability and robustness in noisy settings but also outperforms existing methods like DeltaNet across benchmarks, offering a theoretically sound and practically scalable foundation for high-fidelity attention.

Top Figure

Method

The authors leverage a continuous-time dynamical systems perspective to derive an exact, error-free solution for linear attention, addressing the numerical instability and error accumulation inherent in low-order discretization schemes. The core insight is to model the online learning update of the associative memory state St\mathbf{S}_tSt as a first-order ordinary differential equation (ODE). This ODE is defined as dS(t)dt=AtS(t)+bt\frac{d\mathbf{S}(t)}{dt} = -\mathbf{A}_t\mathbf{S}(t) + \mathbf{b}_tdtdS(t)=AtS(t)+bt, where the dynamics matrix At=ktkt\mathbf{A}_t = \mathbf{k}_t\mathbf{k}_t^\topAt=ktkt and the forcing term bt=ktvt\mathbf{b}_t = \mathbf{k}_t\mathbf{v}_t^\topbt=ktvt are derived from the key and value vectors at time ttt. This formulation generalizes the delta rule update, which corresponds to a first-order explicit Euler discretization of this ODE. By recognizing that the dynamics matrix At\mathbf{A}_tAt is rank-1, the authors exploit its algebraic properties to compute the exact analytical solution of the ODE. This solution, which corresponds to the infinite-order limit of the Runge-Kutta family of methods, is given by St=eβtAtSt1+0βte(βtτ)Atbtdτ\mathbf{S}_t = e^{-\beta_t \mathbf{A}_t} \mathbf{S}_{t-1} + \int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tauSt=eβtAtSt1+0βte(βtτ)Atbtdτ. The rank-1 structure allows the matrix exponential eβtAte^{-\beta_t \mathbf{A}_t}eβtAt to be computed in closed form as I1eβtλtλtAt\mathbf{I} - \frac{1 - e^{-\beta_t \lambda_t}}{\lambda_t} \mathbf{A}_tIλt1eβtλtAt, where λt=ktkt\lambda_t = \mathbf{k}_t^\top \mathbf{k}_tλt=ktkt. Similarly, the integral term simplifies to 1eβtλtλtbt\frac{1 - e^{-\beta_t \lambda_t}}{\lambda_t} \mathbf{b}_tλt1eβtλtbt. Substituting these closed-form expressions yields the final update rule for the Error-Free Linear Attention (EFLA) mechanism. This update maintains linear time complexity with respect to sequence length, enabling efficient computation while capturing the exact continuous dynamics.

[[IMG:|The framework diagram illustrates the continuous-time dynamical system formulation of linear attention, showing the state S(t)\mathbf{S}(t)S(t) evolving according to the ODE dS(t)dt=AtS(t)+bt\frac{d\mathbf{S}(t)}{dt} = -\mathbf{A}_t\mathbf{S}(t) + \mathbf{b}_tdtdS(t)=AtS(t)+bt. The figure highlights the transition from the discrete delta rule update to the continuous-time model, emphasizing the role of the dynamics matrix At\mathbf{A}_tAt and the forcing term bt\mathbf{b}_tbt. The exact solution of this ODE, derived using the rank-1 structure of At\mathbf{A}_tAt, is the foundation of the EFLA mechanism.]]

Experiment

  • Numerical Stability and Robustness Verification: On sMNIST with pixel dropout, OOD intensity scaling, and additive Gaussian noise, EFLA outperforms DeltaNet in convergence speed and robustness, maintaining high accuracy under severe interference. EFLA achieves significantly better performance at high input scales and with larger learning rates, validating its exact saturation mechanism mitigates error accumulation and state explosion.
  • Language Modeling: On Wikitext and zero-shot reasoning tasks (LAMBADA, PiQA, HellaSwag, WinoGrande, ARC-e, ARC-c, BoolQ, OpenBookQA, SciQ), EFLA with 340M parameters achieves lower perplexity (37.01 vs. 38.09) and higher accuracy (23.9% vs. 22.5% on LAMBADA), with a +7.4% absolute improvement on BoolQ. At 1.3B parameters, EFLA maintains a performance lead even at 16B tokens, indicating superior long-sequence fidelity and scalability.

The authors use a 340M and 1.3B parameter model to compare EFLA against DeltaNet on language modeling and reasoning tasks, with results shown in Table 1. Results show that EFLA consistently outperforms DeltaNet across most metrics, achieving lower perplexity on Wikitext and LAMBADA and higher accuracy on multiple reasoning benchmarks, with the performance gap widening at larger model sizes.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp