エラーフリーな線形アテンションはフリーランチである:連続時間ダイナミクスからの正確な解
エラーフリーな線形アテンションはフリーランチである:連続時間ダイナミクスからの正確な解
Jingdi Lei Di Zhang Soujanya Poria
Abstract
線形時間アテンションと状態空間モデル(SSMs)は、softmaxアテンションを用いる長文脈言語モデルにおける二次時間計算量のボトルネックを解消する可能性を秘めている。本研究では、数値安定性に優れ、完全並列処理が可能で、一般化されたデルタ則の定式化として、誤差のない線形アテンション(EFLA: Error-Free Linear Attention)を提案する。具体的には、オンライン学習の更新を連続時間動的システムとして定式化し、その正確な解が線形時間でかつ完全並列的に計算可能であることを証明する。動的行列のランク1構造を活用することで、無限次Runge-Kutta法に相当する正確な閉形式解を直接導出する。このアテンション機構は理論的に誤差蓄積の問題がなく、連続的な動的挙動を完全に捉えつつ、線形時間計算量を維持する。広範な実験により、EFLAがノイズ環境下でも堅牢な性能を発揮し、追加パラメータを導入せずにDeltaNetを上回る言語モデリングの perplexity と、下流タスクにおける優れたベンチマーク性能を達成することを示した。本研究は、高忠実度かつスケーラブルな線形時間アテンションモデルの構築に向けた、新たな理論的基盤を提供する。
One-sentence Summary
Jingdi Lei, Di Zhang, and Soujanya Poria from Nanyang Technological University and Fudan University propose Error-Free Linear Attention (EFLA), a theoretically exact, numerically stable linear-time attention mechanism derived from solving the continuous-time ODE governing linear attention via an infinite-order Runge–Kutta method. By exploiting the rank-1 structure of the dynamics matrix, EFLA achieves closed-form, error-free updates with full parallelism and linear complexity, outperforming DeltaNet in robustness to noise and downstream tasks while eliminating discretization errors inherent in prior Euler-based approximations.
Key Contributions
-
Existing linear attention methods suffer from inherent numerical instability and truncation errors due to their reliance on first-order Euler discretization of continuous-time dynamics, which limits their accuracy and robustness in long-context scenarios despite their computational efficiency.
-
The paper reformulates linear attention as a continuous-time dynamical system governed by a first-order ODE, revealing that the standard approach corresponds to a low-order numerical integration scheme that fails to capture the true evolution of the state.
-
By exploiting the rank-1 structure of the dynamics matrix, the authors derive an exact closed-form solution equivalent to the infinite-order Runge–Kutta limit, achieving error-free integration with linear time complexity, full parallelism, and consistent performance gains over DeltaNet and other baselines.
Introduction
The authors leverage the growing role of large language models as autonomous agents in complex, long-context tasks—such as reasoning and tool use—where standard attention mechanisms become computationally prohibitive due to their quadratic time complexity. Prior linear attention methods, while efficient, rely on low-order numerical approximations like Euler integration to solve the underlying continuous-time dynamics, introducing truncation errors and instability, especially under long sequences or high decay rates. These approximations are inherently limited, with heuristic fixes like gating or adaptive coefficients only mitigating symptoms rather than eliminating the root cause. The authors’ main contribution is EFLA, a principled reformulation of linear attention as a continuous-time dynamical system governed by a first-order ODE. By exploiting the rank-1 structure of the system, they derive an exact closed-form solution equivalent to the infinite-order Runge–Kutta limit, achieving error-free integration while preserving linear time complexity. This approach not only ensures numerical stability and robustness in noisy settings but also outperforms existing methods like DeltaNet across benchmarks, offering a theoretically sound and practically scalable foundation for high-fidelity attention.

Method
The authors leverage a continuous-time dynamical systems perspective to derive an exact, error-free solution for linear attention, addressing the numerical instability and error accumulation inherent in low-order discretization schemes. The core insight is to model the online learning update of the associative memory state St as a first-order ordinary differential equation (ODE). This ODE is defined as dtdS(t)=−AtS(t)+bt, where the dynamics matrix At=ktkt⊤ and the forcing term bt=ktvt⊤ are derived from the key and value vectors at time t. This formulation generalizes the delta rule update, which corresponds to a first-order explicit Euler discretization of this ODE. By recognizing that the dynamics matrix At is rank-1, the authors exploit its algebraic properties to compute the exact analytical solution of the ODE. This solution, which corresponds to the infinite-order limit of the Runge-Kutta family of methods, is given by St=e−βtAtSt−1+∫0βte−(βt−τ)Atbtdτ. The rank-1 structure allows the matrix exponential e−βtAt to be computed in closed form as I−λt1−e−βtλtAt, where λt=kt⊤kt. Similarly, the integral term simplifies to λt1−e−βtλtbt. Substituting these closed-form expressions yields the final update rule for the Error-Free Linear Attention (EFLA) mechanism. This update maintains linear time complexity with respect to sequence length, enabling efficient computation while capturing the exact continuous dynamics.
[[IMG:|The framework diagram illustrates the continuous-time dynamical system formulation of linear attention, showing the state S(t) evolving according to the ODE dtdS(t)=−AtS(t)+bt. The figure highlights the transition from the discrete delta rule update to the continuous-time model, emphasizing the role of the dynamics matrix At and the forcing term bt. The exact solution of this ODE, derived using the rank-1 structure of At, is the foundation of the EFLA mechanism.]]
Experiment
- Numerical Stability and Robustness Verification: On sMNIST with pixel dropout, OOD intensity scaling, and additive Gaussian noise, EFLA outperforms DeltaNet in convergence speed and robustness, maintaining high accuracy under severe interference. EFLA achieves significantly better performance at high input scales and with larger learning rates, validating its exact saturation mechanism mitigates error accumulation and state explosion.
- Language Modeling: On Wikitext and zero-shot reasoning tasks (LAMBADA, PiQA, HellaSwag, WinoGrande, ARC-e, ARC-c, BoolQ, OpenBookQA, SciQ), EFLA with 340M parameters achieves lower perplexity (37.01 vs. 38.09) and higher accuracy (23.9% vs. 22.5% on LAMBADA), with a +7.4% absolute improvement on BoolQ. At 1.3B parameters, EFLA maintains a performance lead even at 16B tokens, indicating superior long-sequence fidelity and scalability.
The authors use a 340M and 1.3B parameter model to compare EFLA against DeltaNet on language modeling and reasoning tasks, with results shown in Table 1. Results show that EFLA consistently outperforms DeltaNet across most metrics, achieving lower perplexity on Wikitext and LAMBADA and higher accuracy on multiple reasoning benchmarks, with the performance gap widening at larger model sizes.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.