14 hours ago

Jingyang Qiao Weicheng Meng Yu Cheng Zhihang Lin Zhizhong Zhang Xin Tan Jingyu Gong Kun Shao Yuan Xie

Table of Contents

Abstract

Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA. First, MIA significantly enhances the current SOTA LLMs' performance in deep research tasks. For instance, MIA further boosts GPT-5.4 performance by up to 9% and 6% on LiveVQA and HotpotQA, respectively. Furthermore, with the lightweight Executor, like Qwen2.5-VL-7B, MIA can also achieve an average improvement of 31% across evaluated datasets, outperforming the much larger Qwen2.5-VL-32B by a margin of 18%, highlighting its remarkable performance. Additionally, training analysis reveals that reinforcement learning enables the Planner and Executor to synergistically optimize their strategies, effectively capturing dataset-specific characteristics and enhancing cross-domain reasoning and memory capabilities. Tool analysis reveals that long-context memory methods struggle with multi-turn tool interaction, while our proposed MIA significantly outperforms previous methods. Under unsupervised settings, MIA achieves performance comparable to its supervised counterpart, meanwhile exhibiting the progressive self-evolution performance across multiple training iterations.

One-sentence Summary

The Memory Intelligence Agent (MIA) framework employs a Manager-Planner-Executor architecture to unify parametric and non-parametric memories via bidirectional conversion and alternating reinforcement learning, enabling continuous test-time evolution that achieves up to 9% and 6% performance gains for GPT-5.4 on LiveVQA and HotpotQA respectively while allowing a lightweight Qwen2.5-VL-7B Executor to achieve a 31% average improvement across eleven benchmarks and outperform the Qwen2.5-VL-32B by 18%.

Key Contributions

The paper introduces the Memory Intelligence Agent (MIA) framework, which employs a Manager-Planner-Executor architecture to store compressed historical search trajectories and generate search plans. This design facilitates a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution during inference.
An alternating reinforcement learning paradigm is adopted to enhance cooperation between the Planner and the Executor, allowing the Planner to continuously evolve during test-time learning without interrupting reasoning. Training analysis indicates this approach enables synergistic optimization of strategies, effectively capturing dataset-specific characteristics and enhancing cross-domain reasoning capabilities.
Extensive experiments across eleven benchmarks demonstrate the superiority of MIA, showing significant performance enhancements for current SOTA LLMs in deep research tasks. The method boosts GPT-5.4 performance by up to 9% on LiveVQA and enables a lightweight Executor to outperform larger models by an 18% margin while achieving comparable performance in unsupervised settings.

Introduction

Deep research agents combine large language models with external tools to handle complex tasks, relying heavily on memory systems to accumulate experience and refine strategies over time. However, existing methods often depend on long-context memory that stores raw search trajectories, leading to bloated storage, retrieval inefficiencies, and attention dilution. Furthermore, prior planners lack task-specific training, while memory systems focus on factual data rather than the process-oriented strategies needed for effective search planning. To address these issues, the authors introduce the Memory Intelligence Agent (MIA) framework, featuring a Manager-Planner-Executor architecture. They utilize a non-parametric system for compressed historical storage and a parametric agent for dynamic planning. Through alternating reinforcement learning and bidirectional memory conversion, MIA enables continuous test-time evolution. This approach significantly enhances reasoning performance and allows smaller models to outperform larger counterparts on deep research benchmarks.

Dataset

The authors evaluate their framework on both multimodal and text-only datasets to assess performance across different modalities.

Multimodal Data Composition and Usage

Training: The authors adopt FVQA-train, which contains 4,856 image-question-answer examples from the MMSearch-R1 setting, to train both the Executor and the Planner.
Public Evaluation Sets: They utilize FVQA-test (1,800 examples), InfoSeek (2,000 examples), SimpleVQA (1,013 examples), and MMSearch (171 examples) as held-out benchmarks. For LiveVQA, they use the accessible public version with 2,384 examples instead of the deprecated 3,602-example version.
In-house Multimodal Sets: Two custom subsets were constructed to address specific domains. In-house 1 includes 295 examples in scientific fields like Physics and Chemistry, created by crawling text, synthesizing QA pairs, and retrieving matched images. In-house 2 consists of 505 examples from dynamic domains such as Sports and Entertainment, generated from news corpora where Qwen2.5-VL-72B-Instruct anchors visual entities to build complex multi-hop questions.

Text-Only Data Composition and Usage

Training: The authors incorporate 6,175 examples from MATPO into the Planner training data to enhance planning abilities in textual environments.
Evaluation: The text-only evaluation suite comprises 2Wiki (12,576 samples), HotpotQA (7,405 samples), SimpleQA (4,327 samples), and GAIA-Text (103 examples). These datasets test multi-hop reasoning, factual knowledge retrieval, and general problem-solving skills.

Method

The proposed MIA framework introduces a Manager-Planner-Executor architecture designed to address storage bottlenecks and reasoning inefficiencies in deep research agents. This architecture decouples historic memory, parametric planning, and dynamic execution. The core components include the Planner, which acts as a cognitive hub to generate search plans; the Executor, which serves as an operational terminal to implement plans via tool usage; and the Memory Manager, which handles memory compression and retrieval.

Refer to the framework diagram below for a visual representation of the overall architecture and reasoning process.

The reasoning process operates through a planning-execution-memory loop comprising three stages: memory retrieval, collaborative reasoning, and experience consolidation. Initially, the Memory Manager retrieves relevant historical trajectories based on semantic similarity, value reward, and frequency reward to provide contextual support. The Planner then decomposes complex tasks into executable sub-goals using a few-shot Chain-of-Thought strategy. The Executor interacts with the environment via a ReAct loop, performing task reasoning and tool usage. A dynamic feedback loop connects the two agents; if the Executor encounters an impasse, the Planner triggers a Reflect-Replan mechanism to adjust the search plan.

To ensure efficient memory storage, the system compresses verbose trajectories into structured workflow summaries. This process involves converting visual inputs into textual captions and abstracting action patterns. The prompts used for this memory compression are detailed in the figure below.

To optimize the collaboration between the Planner and Executor, the authors propose a two-stage alternating Reinforcement Learning (RL) training strategy based on Group Relative Policy Optimization (GRPO). In the first stage, the Planner is frozen while the Executor is trained to understand and follow plans. In the second stage, the Executor is frozen, and the Planner is optimized using memory contexts to enhance plan generation and reflection capabilities. The training pipeline, including the forward rollout and backward reward calculation, is illustrated below.

During Test-Time Learning (TTL), MIA adopts an online learning paradigm that performs exploration, storage, and learning simultaneously. This mechanism allows the agent to update its parametric memory (Planner parameters) and non-parametric memory (Workflow Memory) without interrupting the reasoning workflow. In unsupervised environments where ground-truth answers are unavailable, the framework employs a novel evaluation mechanism that mimics the peer-review process. This involves specialized reviewers for reasoning, information sourcing, and result validity, overseen by an Area Chair agent to ensure robust self-evolution. This unsupervised evaluation framework is depicted below.

Experiment

MIA is evaluated on diverse multimodal and text-only benchmarks against closed-source models and memory-based agents. Results indicate that MIA's dual-memory mechanism and online updating mitigate noise found in traditional contextual methods, surpassing many closed-source models and agents. Ablation and generalization analyses further validate that guiding planning with memory rather than direct execution, combined with test-time learning, ensures robust reasoning across modalities and executor types.

The authors evaluate their proposed MIA framework against direct answer models, search agents, and memory-based agents across multiple multimodal datasets. The results indicate that MIA achieves superior performance compared to most baselines, effectively bridging the gap in internalizing historical experiences better than contextual memory approaches. The proposed MIA model achieves top performance among open-source models and rivals the capabilities of leading closed-source systems like Gemini-3-Flash across various benchmarks. Contextual memory methods such as RAG and Mem0 generally underperform the No Memory baseline, suggesting that long memory contexts can introduce noise. The unsupervised variant of MIA demonstrates strong capabilities, achieving competitive results that approach the performance of the fully supervised version.

The authors conduct an ablation study to evaluate the contribution of individual components within their proposed framework on multimodal datasets. Results indicate that incrementally introducing modules such as memory-guided planning and test-time learning consistently enhances performance over the baseline. The complete system achieves the highest accuracy across all in-domain and out-of-domain benchmarks compared to intermediate variations. Integrating memory specifically to guide the planning process yields significant performance gains compared to using memory alone or a baseline without it. Training the planner component through alternating reinforcement learning provides substantial improvements over using a standard pre-trained planner. The full framework incorporating test-time learning outperforms all ablated versions across every evaluated dataset, demonstrating the effectiveness of the complete architecture.

The the the table presents an evaluation of the Unsupervised MIA model across multimodal and text-only benchmarks, comparing its performance against Base and Plan and Reflect baselines. It specifically tracks the model's progress over three epochs of unsupervised self-evolution to demonstrate the effectiveness of the proposed method. The results indicate that the model consistently outperforms the baselines and shows steady improvements in accuracy as it undergoes more epochs of training. Unsupervised MIA achieves higher accuracy than the Base and Plan and Reflect models across all evaluated datasets. Performance demonstrates a consistent upward trend from epoch-1 to epoch-3, validating the effectiveness of the self-evolution mechanism. The model exhibits significant improvements on text-only reasoning tasks compared to the baseline configurations.

The the the table presents an ablation study evaluating the impact of specific architectural components on text-only out-of-domain reasoning tasks. Results show a consistent performance increase as the model integrates memory-guided planning, reflection, and test-time learning mechanisms. The complete framework outperforms all intermediate variants and the baseline across every dataset column. Guiding the planner with memory yields significantly better results than using memory or planning in isolation. Training the planner model through reinforcement learning leads to substantial performance gains over a standard pre-trained planner. The introduction of online test-time learning provides the final substantial improvement, securing the highest accuracy across all benchmarks.

The the the table compares the proposed MIA model against various baselines on text-only out-of-domain datasets including SimpleQA, 2Wiki, HotpotQA, and GAIA. MIA demonstrates superior performance across all metrics, while traditional contextual memory methods often lag behind the no-memory baseline. The results validate the effectiveness of MIA's architecture in handling complex reasoning tasks without degrading performance. MIA consistently outperforms all baseline models including Memento and ExpeL across SimpleQA, 2Wiki, HotpotQA, and GAIA. Contextual memory methods such as RAG and Mem0 generally underperform the no-memory baseline, indicating potential noise introduction. The unsupervised variant of MIA achieves competitive results, closely approaching the performance of the supervised version.

The authors evaluate the proposed MIA framework against diverse baselines across multimodal and text-only out-of-domain datasets, demonstrating superior performance that rivals leading closed-source systems. Ablation studies confirm that integrating memory-guided planning, reinforcement learning, and test-time learning significantly enhances accuracy compared to isolated components or standard pre-trained planners. Notably, traditional contextual memory methods often underperform due to noise, whereas the unsupervised variant achieves competitive results through effective self-evolution.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

14 hours ago

Agent

LLM

Intelligent Question Answering

Jingyang Qiao Weicheng Meng Yu Cheng Zhihang Lin Zhizhong Zhang Xin Tan Jingyu Gong Kun Shao Yuan Xie

Table of Contents

Abstract

One-sentence Summary

Key Contributions

The paper introduces the Memory Intelligence Agent (MIA) framework, which employs a Manager-Planner-Executor architecture to store compressed historical search trajectories and generate search plans. This design facilitates a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution during inference.
An alternating reinforcement learning paradigm is adopted to enhance cooperation between the Planner and the Executor, allowing the Planner to continuously evolve during test-time learning without interrupting reasoning. Training analysis indicates this approach enables synergistic optimization of strategies, effectively capturing dataset-specific characteristics and enhancing cross-domain reasoning capabilities.
Extensive experiments across eleven benchmarks demonstrate the superiority of MIA, showing significant performance enhancements for current SOTA LLMs in deep research tasks. The method boosts GPT-5.4 performance by up to 9% on LiveVQA and enables a lightweight Executor to outperform larger models by an 18% margin while achieving comparable performance in unsupervised settings.

Introduction

Dataset

The authors evaluate their framework on both multimodal and text-only datasets to assess performance across different modalities.

Multimodal Data Composition and Usage

Training: The authors adopt FVQA-train, which contains 4,856 image-question-answer examples from the MMSearch-R1 setting, to train both the Executor and the Planner.
Public Evaluation Sets: They utilize FVQA-test (1,800 examples), InfoSeek (2,000 examples), SimpleVQA (1,013 examples), and MMSearch (171 examples) as held-out benchmarks. For LiveVQA, they use the accessible public version with 2,384 examples instead of the deprecated 3,602-example version.
In-house Multimodal Sets: Two custom subsets were constructed to address specific domains. In-house 1 includes 295 examples in scientific fields like Physics and Chemistry, created by crawling text, synthesizing QA pairs, and retrieving matched images. In-house 2 consists of 505 examples from dynamic domains such as Sports and Entertainment, generated from news corpora where Qwen2.5-VL-72B-Instruct anchors visual entities to build complex multi-hop questions.

Text-Only Data Composition and Usage

Training: The authors incorporate 6,175 examples from MATPO into the Planner training data to enhance planning abilities in textual environments.
Evaluation: The text-only evaluation suite comprises 2Wiki (12,576 samples), HotpotQA (7,405 samples), SimpleQA (4,327 samples), and GAIA-Text (103 examples). These datasets test multi-hop reasoning, factual knowledge retrieval, and general problem-solving skills.

Method

Refer to the framework diagram below for a visual representation of the overall architecture and reasoning process.

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Memory Intelligence Agent

Jingyang Qiao Weicheng Meng Yu Cheng Zhihang Lin Zhizhong Zhang Xin Tan Jingyu Gong Kun Shao Yuan Xie

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Memory Intelligence Agent

Jingyang Qiao Weicheng Meng Yu Cheng Zhihang Lin Zhizhong Zhang Xin Tan Jingyu Gong Kun Shao Yuan Xie

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Memory Intelligence Agent

Jingyang Qiao Weicheng Meng Yu Cheng Zhihang Lin Zhizhong Zhang Xin Tan Jingyu Gong Kun Shao Yuan Xie

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters