Command Palette
Search for a command to run...
MemEye: 다중모형 에이전트 메모리를 위한 시각 중심 평가 프레임워크
MemEye: 다중모형 에이전트 메모리를 위한 시각 중심 평가 프레임워크
초록
장기적 에이전트 메모리는 점점 더 다중 모달화되고 있으나, 기존 평가들은 에이전트가 이후 추론에 필요한 시각적 증거를 보존하는지를 거의 테스트하지 않는다. 기존 연구에서 많은 시각적 기반 질문들은 캡션이나 텍스트 흔적만을 사용하여 답변할 수 있었으며, 이를 통해 세밀한 시각적 증거를 보존하지 않고도 답변을 추론할 수 있었다. 반면, 변화하는 시각적 상태에 대한 추론이 필요한 더 어려운 사례들은 대부분 부재한다. 따라서 우리는 두 가지 차원에서 메모리 능력을 평가하는 프레임워크인 MemEye를 소개한다. 하나의 차원은 결정적인 시각적 증거의 세분화 정도(장면 수준부터 픽셀 수준 증거까지)를 측정하며, 다른 차원은 검색된 증거가 어떻게 사용되어야 하는지(단일 증거부터 진화적 합성까지)를 측정한다. 이 프레임워크 하에서, 우리는 8가지 생활 시나리오 작업에 걸쳐 새로운 벤치마크를 구성하였으며, 이는 답변 가능성, 단서 회피 저항성, 시각적 필요성, 그리고 추론 구조를 평가하기 위한 아블레이션 기반 검증 게이트를 포함한다. 4가지 VLM 백본에 걸쳐 13가지 메모리 방법을 평가한 결과, 현재 아키텍처들이 여전히 세밀한 시각적 세부 사항을 보존하고 시간에 따른 상태 변화에 대해 추론하는 데 어려움을 겪고 있음을 보여준다. 우리의 발견들은 장기적 다중 모달 메모리가 증거 라우팅, 시간적 추적, 그리고 세부 사항 추출에 의존함을 보여준다.
One-sentence Summary
MemEye introduces a visual-centric evaluation framework that benchmarks multimodal agent memory across eight life-scenario tasks by measuring evidence granularity and retrieval synthesis, revealing through the evaluation of thirteen memory methods on four VLM backbones that current architectures struggle to preserve fine-grained visual details or reason over temporal state changes, thereby highlighting the necessity for robust evidence routing and tracking.
Key Contributions
- Introduces MemEye, a two-dimensional evaluation framework that assesses long-term multimodal agent memory by measuring visual evidence granularity and the complexity of synthesized reasoning.
- Constructs a benchmark across eight life-scenario tasks validated through ablation-driven gates and caption-substitution diagnostics to enforce visual necessity and prevent textual shortcut exploitation.
- Evaluates thirteen memory methods across four VLM backbones to demonstrate that existing architectures struggle to preserve fine-grained visual details over time, establishing that effective long-term memory relies on evidence routing and temporal tracking.
Introduction
The authors leverage the growing reliance on Vision-Language Models for long-term agent memory, which is essential for enabling AI systems to handle complex real-world tasks that require retaining both dialogue history and visual context. Prior evaluation frameworks, however, largely overlook this visual dimension by relying on text-heavy benchmarks or allowing models to answer questions using captions rather than original images. This design flaw masks critical failures in preserving fine-grained visual details and tracking temporal state changes across sessions. To address these gaps, the authors introduce MemEye, a novel evaluation framework and benchmark that measures multimodal memory along two orthogonal axes: visual evidence granularity and memory reasoning depth. By constructing a rigorously validated dataset of 371 questions across eight life-scenario tasks, they expose fundamental trade-offs in current architectures and demonstrate that reliable long-term visual memory requires precise evidence routing, temporal tracking, and detail extraction.
Dataset
Dataset Overview The authors present MemEye, a vision-centric benchmark designed to evaluate long-horizon multimodal agent memory. The dataset comprises 371 questions distributed across 221 sessions, 848 dialogue rounds, and 438 images. Each question is provided in mirrored multiple-choice and open-ended formats to support diverse evaluation setups.
Composition and Sources The benchmark spans eight tasks grouped into four life-scenario domains: Leisure, Domestic, Professional, and Personal. Image provenance varies by task and includes archival, public, and generated media. Specific sources include the Pitt Image Ads dataset for Brand Memory, public-domain comic strips and Seed-Story for Cartoon Entertainment, stock interior-design photographs for Home Renovation, HTML-rendered screenshots based on Cardiverse for Card Playlog, StyleGAN faces combined with PIL-rendered UI for Social Chat, dashcam frames from the Japan Open Driving Dataset for Outdoor Navigation, and AI-generated images via DALL-E for CrossScene Memory and Health Care. The collection covers diverse image types such as photographs, screenshots, comic panels, and interface renderings.
Taxonomy and Metadata Every question is assigned an (X, Y) coordinate using a highest-bottleneck rule that captures the finest required visual evidence and the deepest memory operation. The X-axis measures visual evidence granularity with four levels: X1 for scene-level gist, X2 for region-level spatial details, X3 for instance-level identification, and X4 for pixel-level attributes like exact color, text, or texture. The Y-axis assesses memory reasoning depth with three levels: Y1 for atomic retrieval of a single fact, Y2 for relational association of non-conflicting distributed clues, and Y3 for evolutionary synthesis involving updates, conflicts, or state overrides. The dataset includes metadata on taxonomy distribution and marks each task by whether visual evidence is archival or generated.
Processing and Filtering The authors apply three rigorous filtering mechanisms to ensure the benchmark tests visual memory rather than text solvability or foundation model recognition. First, they eliminate answer leakage by testing questions against dialogue text alone and removing items solvable without visual evidence. Second, they test for visual bypassability by replacing images with minimal captions and discarding questions that remain answerable with text descriptions. Third, they control for inherent difficulty by providing images with answer-relevant context to isolate answerability from memory constraints, removing items that fail due to base model limitations. The authors also mitigate multiple-choice bias by creating four rotated variants for each question where the correct answer cycles through all options.
Usage and Evaluation The dataset is structured for evaluation and includes prompt templates for multimodal multiple-choice answering, text-plus-caption answering, question generation, and taxonomy annotation. Evaluation employs an LLM-as-a-judge framework with a detailed rubric that scores responses on a scale from 0 to 1 based on semantic equivalence, handling of negations, identity matching, and avoidance of hallucinations. The authors provide JSON output formats for question generation and annotation to standardize metadata construction and ensure consistent labeling of visual and reasoning requirements.
Method
The authors leverage a two-dimensional evaluation framework to structure MemEye, organizing tasks along a coordinate system defined by visual perception granularity and reasoning depth. The X-axis, referred to as visual granularity, spans four levels from coarse to fine: scene-level (X1), region-level (X2), instance-level (X3), and pixel-level (X4). These levels correspond to the scale at which visual evidence must be processed, ranging from global scene semantics to fine-grained pixel details such as color and texture. The Y-axis represents reasoning complexity, capturing the depth of cognitive processing required to retrieve and synthesize evidence. It is structured into three levels: Y1 (Atomic Retrieval), where a single evidence unit suffices; Y2 (Relational Association), which requires combining multiple non-redundant evidence units; and Y3 (Evolutionary Synthesis), where temporal ordering and state updates across clues are necessary to resolve the answer.
The framework is implemented through a multi-stage pipeline that begins with task and visual evidence generation, followed by the construction of multi-session dialogues and the creation of multiple-choice or open-ended questions. Each question is annotated with its corresponding (X,Y) granularity and complexity levels. The process then enters a rigorous filtering phase, starting with a clue sufficiency check to ensure the question can be answered using ground-truth evidence. This is followed by option-bias rejection, where answer positions are rotated to prevent response bias. A text-leakage filter checks whether the answer can be inferred solely from the text, while a bypass filter evaluates whether a short caption can replace the full image without compromising the question's validity. The difficulty calibration stage ensures that the question maintains appropriate challenge levels across the taxonomy.
A reasoning-structure audit validates that the question adheres to the intended Y-level evidence structure, ensuring that Y1 items require only atomic retrieval, Y2 items involve relational association across multiple clues, and Y3 items necessitate evolutionary synthesis over time. The pipeline concludes with diagnostic evaluations: a caption-proof diagnostic tests whether captions can substitute visual evidence across X levels, and an oracle-evidence diagnostic verifies that the Y-axis reasoning is dependent on multimodal evidence by testing performance under oracle conditions. The final output is a validated MemEye benchmark characterized by high-quality, shortcut-resistant, well-specified, and diverse items ready for evaluation.
Experiment
The study evaluates thirteen memory architectures across four vision-language models using the MemEye benchmark, which organizes tasks along a visual granularity axis and a reasoning depth axis, validated through caption-proof and oracle-evidence diagnostics. Results reveal that current systems struggle with two primary bottlenecks: converting images to text causes significant loss of fine-grained visual details, while retrieval-based approaches frequently select temporally stale evidence when tracking evolving visual states. Consequently, effective long-term multimodal memory requires a hybrid approach that preserves native visual evidence alongside structured textual state records, supplemented by filtering mechanisms that prioritize temporally valid information over semantic similarity.
{"summary": "The authors analyze the performance of various memory systems on a benchmark that evaluates both reasoning depth and visual granularity. The results show that systems perform differently across the two dimensions, with multimodal methods generally outperforming text-based ones in fine-grained visual tasks, while text-based methods show advantages in reasoning over evolving states. The analysis highlights a trade-off between preserving visual evidence and selecting valid states in dynamic memory environments.", "highlights": ["Multimodal methods outperform text-based methods in fine-grained visual tasks but struggle with evolving state selection.", "Text-based methods are more competitive in reasoning over evolving states, where the ability to track updates and conflicts is crucial.", "The benchmark reveals a trade-off between visual evidence preservation and state selection, indicating that no single method excels across both dimensions."]
The authors analyze memory system performance across different levels of visual evidence granularity and reasoning depth, identifying that current systems struggle with fine-grained visual information and evolving visual states. Multimodal methods generally outperform text-based ones in preserving visual details, but retrieval-based systems often fail to select the most recent valid state when evidence changes over time. Multimodal memory systems outperform text-based systems in preserving fine-grained visual evidence, especially at higher visual granularity levels. Retrieval-based methods often fail to select the most recent valid visual state, even when relevant evidence is retrieved. Text-based memory systems can better handle evolving visual states by maintaining structured state records, but they lose fine-grained visual details.
The authors analyze the performance of various memory systems on a benchmark that evaluates both visual evidence granularity and reasoning depth. The results show that multimodal methods generally outperform text-based methods, especially in tasks requiring fine-grained visual details. However, the performance of all systems declines in tasks that require tracking evolving visual states over time, indicating a bottleneck in selecting valid evidence from a dynamic memory history. Multimodal memory methods outperform text-based methods in tasks requiring fine-grained visual evidence. Performance drops across all systems in tasks that require tracking evolving visual states over time. No method fully addresses both visual evidence preservation and evolving state selection simultaneously.
The experiment evaluates multiple memory systems across a two-axis framework that measures visual evidence granularity and reasoning depth over memory. Results show that current systems struggle to handle both fine-grained visual details and evolving visual states simultaneously, with performance varying significantly between text-based and multimodal approaches depending on the task complexity. The analysis reveals distinct failure modes related to visual information loss and state selection, indicating that effective memory systems must combine both visual and textual evidence with mechanisms for selecting valid evidence over time. Current memory systems fail to handle both fine-grained visual evidence and evolving visual states simultaneously, showing distinct failure modes in different parts of the evaluation matrix. Multimodal methods outperform text-based methods on fine-grained visual tasks, while text-based methods show better performance on evolving-state reasoning, indicating a trade-off between visual fidelity and state tracking. The results suggest that future memory systems need to combine image-based and text-based memory with mechanisms for filtering and selecting valid evidence from long, diverse histories.
The authors evaluate multiple memory systems across a two-dimensional benchmark that assesses visual evidence granularity and reasoning depth over memory. Results show that no single method performs well across all conditions, with multimodal methods excelling in fine-grained visual tasks and text-based methods showing better performance in evolving state reasoning. The analysis reveals that current systems struggle with both preserving detailed visual information and selecting the most relevant updated evidence from long histories. Multimodal methods outperform text-based methods in fine-grained visual tasks but struggle with evolving state reasoning. Text-based methods are more effective in reasoning over evolving visual states but lose fine-grained visual details. Current memory systems fail to simultaneously handle both visual evidence preservation and temporal state selection, indicating a need for combined memory architectures.
The experiments evaluate multiple memory systems using a two-dimensional benchmark that validates performance across visual evidence granularity and reasoning depth over dynamic memory. The analysis reveals a distinct qualitative trade-off, with multimodal architectures excelling at preserving fine-grained visual details while text-based methods prove more effective at tracking evolving states. Ultimately, no single approach successfully balances both requirements, indicating that future systems must integrate visual and textual memory with dedicated mechanisms for filtering valid evidence across temporal histories.