6 months ago

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped withlong-term memory. Like humans, M3-Agent can process real-time visual andauditory inputs to build and update its long-term memory. Beyond episodicmemory, it also develops semantic memory, enabling it to accumulate worldknowledge over time. Its memory is organized in an entity-centric, multimodalformat, allowing deeper and more consistent understanding of the environment.Given an instruction, M3-Agent autonomously performs multi-turn, iterativereasoning and retrieves relevant information from memory to accomplish thetask. To evaluate memory effectiveness and memory-based reasoning in multimodalagents, we develop M3-Bench, a new long-video question answering benchmark.M3-Bench comprises 100 newly recorded real-world videos captured from a robot'sperspective (M3-Bench-robot) and 929 web-sourced videos across diversescenarios (M3-Bench-web). We annotate question-answer pairs designed to testkey capabilities essential for agent applications, such as human understanding,general knowledge extraction, and cross-modal reasoning. Experimental resultsshow that M3-Agent, trained via reinforcement learning, outperforms thestrongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o,achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-weband VideoMME-long, respectively. Our work advances the multimodal agents towardmore human-like long-term memory and provides insights into their practicaldesign. Model, code and data are available athttps://github.com/bytedance-seed/m3-agent

Source PDF View Code