Command Palette
Search for a command to run...
長文脈推論のためのゲート付き再帰的メモリ:いつ記憶し、いつ停止するか
長文脈推論のためのゲート付き再帰的メモリ:いつ記憶し、いつ停止するか
Leheng Sheng Yongtao Zhang Wenchang Ma Yaorui Shi Ting Huang Xiang Wang An Zhang Ke Shen Tat-Seng Chua
概要
長文脈における推論は、さまざまな現実世界の応用において不可欠であるが、大規模言語モデル(LLMs)においては、文脈長が増加するにつれて性能の低下が生じるため、依然として大きな課題である。最近の研究であるMemAgentは、RNNに類似したループ構造を用いて文脈をチャンク単位で処理し、最終的な回答に向けたテキスト記憶を更新することでこの問題に取り組んだ。しかし、この単純な再帰的記憶更新方式には二つの根本的な欠陥がある:(i) 証拠のないチャンクに対しても無差別に記憶が更新されるため、記憶が急速に膨張する可能性がある;(ii) ループに終了メカニズムが存在しないため、十分な証拠が集まった後も不要な計算が継続される。これらの問題を解決するために、本研究では、より安定的かつ効率的な長文脈推論を実現するためのGRU-Memを提案する。具体的には、GRU-Memでは、更新ゲートが開いている場合にのみ記憶が更新され、退出ゲートが開くと即座に再帰ループを終了する。このような機能を実現するため、エンドツーエンドの強化学習フレームワーク内に、正しい更新行動を報酬とするr^{update}および正しい退出行動を報酬とするr^{exit}の二つの報酬信号を導入した。さまざまな長文脈推論タスクにおける実験結果から、GRU-Memの有効性と効率性が実証された。本手法は、従来のMemAgentを一般に上回り、推論速度を最大で400%向上させることが確認された。
One-sentence Summary
Researchers from ByteDance Seed, NUS, and USTC propose GRU-Mem, a gated memory agent that stabilizes long-context reasoning in LLMs via controlled updates and early exits, outperforming MemAgent with up to 400% faster inference while reducing memory bloat and redundant computation.
Key Contributions
- GRU-Mem addresses key limitations of prior recurrent memory methods like MemAgent by introducing two text-controlled gates—an update gate that prevents memory explosion by updating only on relevant chunks, and an exit gate that halts computation once sufficient evidence is gathered.
- The model is trained end-to-end via reinforcement learning with two distinct reward signals, rupdate and rexit, which explicitly guide the agent to learn when to update memory and when to terminate the loop.
- Evaluated on long-context QA tasks, GRU-Mem outperforms vanilla MemAgent while achieving up to 400% faster inference speed, demonstrating both improved accuracy and computational efficiency.
Introduction
The authors leverage recurrent memory architectures to tackle long-context reasoning, where LLMs must locate sparse evidence across millions of tokens—a challenge known as the “needle in a haystack” problem. Prior work like MemAgent processes context chunk-by-chunk but suffers from uncontrolled memory growth and no early exit, wasting computation even after sufficient evidence is found. Their main contribution, GRU-Mem, introduces two text-controlled gates—an update gate to selectively refresh memory and an exit gate to terminate processing early—trained via end-to-end reinforcement learning with distinct reward signals for each behavior. This yields both higher accuracy and up to 400% faster inference compared to vanilla MemAgent.
Method
The authors leverage a gated recurrent memory framework, GRU-Mem, to address the instability and inefficiency inherent in vanilla recurrent long-context reasoning. The core innovation lies in augmenting the memory agent with two text-controlled binary gates — an update gate (UG) and an exit gate (EG) — which dynamically regulate memory evolution and workflow termination. This design draws inspiration from gating mechanisms in GRUs, aiming to mitigate memory explosion and enable early exit when sufficient evidence is gathered.
The workflow begins by splitting the long context C into T fixed-size chunks {C1,⋯,CT}. At each step t, the memory agent ϕθ, conditioned on the question Q, current chunk Ct, and previous memory Mt−1, generates three outputs: a candidate memory M^t, an update gate status Ut, and an exit gate status Et. This is formalized as:
Ut,M^t,Et=ϕθ(Q,Ct,Mt−1).The update gate determines whether to overwrite the memory: if Ut is True, Mt←M^t; otherwise, Mt←Mt−1. The exit gate controls workflow continuation: if Et is True, the loop terminates and the answer agent ψθ immediately generates the final answer A^ from Mt and Q. This selective update and conditional termination are critical for maintaining memory stability and reducing unnecessary computation.
Refer to the framework diagram, which illustrates how the update gate discards irrelevant candidate memories and how the exit gate halts processing once the last evidence is encountered, preventing memory explosion and avoiding redundant chunk processing.

To enforce this structured behavior, the memory agent is constrained to output in a predefined format. As shown in the prompt specification, the agent must first reason within tags, then emit yes or no to activate or deactivate the update gate, followed by the candidate memory within tags, and finally continue or end to control the exit gate. This structured output ensures parseability and enforces the gating logic during inference.

Training is conducted end-to-end via reinforcement learning, extending the Multi-Conv DAPO algorithm. The policy model is optimized using a composite loss that incorporates multiple reward signals: outcome reward for answer correctness, update reward for accurate gate activation per turn, exit reward for terminating at the correct evidence chunk, and format reward for adhering to the structured output. The advantage calculation is disentangled into trajectory-level and turn-level components, which are combined with a hyperparameter α to balance global and local optimization signals. This dual advantage structure stabilizes training by separately evaluating the impact of gate decisions across turns and trajectories.
As shown in the advantage calculation diagram, the policy model receives feedback from both the trajectory-level advantage (comparing across entire workflows) and the turn-level advantage (comparing gate decisions at each step), enabling fine-grained control over the gating behavior.

During inference, the authors provide flexibility by supporting two modes: with exit gate (w EG) and without exit gate (w/o EG). The w EG mode terminates early when Et is True, while the w/o EG mode processes all chunks regardless of gate status, accommodating tasks requiring full context traversal. This dual-mode inference ensures adaptability across diverse long-context reasoning scenarios.
Experiment
- GRU-Mem outperforms vanilla MemAgent across diverse QA and NIAH tasks, especially on out-of-distribution and multi-key benchmarks, with greater stability under long contexts and smaller model sizes.
- GRU-Mem achieves substantial inference speedups—up to 400% faster with early exit—without sacrificing accuracy, thanks to efficient memory management and adaptive termination.
- The update gate curbs memory explosion by selectively updating only evidence-relevant chunks, while the exit gate enables early termination when evidence is found early, improving flexibility in unbalanced evidence scenarios.
- Ablation studies show that α=0.9 balances update accuracy and reward stability; RL training further boosts performance, especially on harder tasks, by refining gating behaviors during training.
- Training dynamics reveal rapid learning of correct formatting and exit behavior, with response length and exit deviation stabilizing over time as the model learns to update and stop precisely.
The authors use GRU-Mem to enhance memory agent performance and efficiency across varying context lengths, showing consistent gains over MemAgent in both accuracy and inference speed. Results show GRU-Mem maintains strong performance on out-of-distribution tasks while achieving up to 2x faster inference without early exit and even greater acceleration when early exit is enabled. The method’s gating mechanisms help control memory growth and enable timely termination, contributing to stable and efficient long-context reasoning.

The authors use GRU-Mem to enhance memory agent performance and efficiency across varying context lengths, showing consistent gains over MemAgent in both accuracy and inference speed. Results show GRU-Mem achieves up to 400% faster inference with early exit mechanisms while maintaining or improving task performance, especially on out-of-distribution and multi-key tasks. The gating mechanisms enable more selective memory updates and earlier termination, reducing computational overhead without sacrificing accuracy.

The authors use GRU-Mem to enhance memory agent performance and efficiency across varying context lengths, showing consistent gains over MemAgent in both accuracy and inference speed. Results show that GRU-Mem achieves up to 2x faster inference without sacrificing performance, particularly benefiting from its gating mechanisms that control memory updates and early exit decisions. The model maintains stable performance even as context scales, with efficiency improvements becoming more pronounced at longer lengths.

The authors use GRU-Mem to enhance memory agent performance and efficiency across varying context lengths, showing consistent gains over MemAgent in both accuracy and inference speed. Results show that GRU-Mem with early exit achieves up to 4x faster inference while maintaining or improving task performance, especially under longer contexts. The efficiency gains are more pronounced as context size increases, indicating better scalability for long-context reasoning tasks.

The authors use GRU-Mem to enhance memory agent performance and efficiency across varying context lengths, showing consistent gains over MemAgent in both accuracy and inference speed. Results show GRU-Mem maintains stronger performance on out-of-distribution tasks and achieves up to 400% faster inference with early exit mechanisms, while also stabilizing memory growth through gated updates. The method proves particularly effective under long-context and evidence-unbalanced scenarios, with performance improvements scaling with context size.
