HyperAIHyperAI

Command Palette

Search for a command to run...

MemDreamer: Entkopplung von Wahrnehmung und Schlussfolgerung für das Verständnis langer Videos mittels Hierarchischem Graphen-Gedächtnis und Agentic Retrieval Mechanismus

Cong Chen Guo Gan Kaixiang Ji ChaoYang Zhang Zhen Yang Guangming Yao Hao Chen Jingdong Chen Yi Yuan Chunhua Shen

Zusammenfassung

Aktuelle Vision-Language-Modelle haben Schwierigkeiten mit stundenlangen Videos, da die Verarbeitung vollständiger visueller Sequenzen zu einer prohibitiven token explosion und einer Aufmerksamkeitsverdünnung führt. Um dies zu bewältigen, führen wir MemDreamer ein, um Wahrnehmung und Schlussfolgerung zu entkoppeln und das Verständnis langer Videos in einen agentic exploration process zu überführen. Als Plug-and-Play-Framework streamt es Videos inkrementell, um ein Hierarchical Graph Memory zu konstruieren, eine top-down dreistufige Architektur für semantische Abstraktion, die von einem foundational graph verankert wird, der räumlich-zeitliche und kausale Beziehungen erfasst. Während der Inferenz setzt das Schlussfolgerungsmodell auf agentic tool-augmented retrieval, navigiert durch Hierarchien, durchsucht Knoten und traversiert logische Kanten mittels eines Observation-Reason-Action loops. Experimente zeigen, dass MemDreamer über vier gängige Benchmarks hinweg SOTA-Ergebnisse erzielt und die Lücke zu menschlichen Experten auf lediglich 3,7 Punkte verringert. Es beschränkt das reasoning context window auf lediglich 2 % der full-context ingestion und erzielt dabei einen absoluten Genauigkeitsgewinn von 12,5 Punkten. Darüber hinaus offenbart eine statistische Analyse eine starke positive lineare Korrelation zwischen der Leistung eines VLM's auf Benchmarks für logisches Schlussfolgern und dem Verständnis langer Videos, wodurch agentic capability scaling als neues Paradigma für multimodales Verständnis etabliert wird.

One-sentence Summary

MEMDREAMER decouples perception and reasoning for long video understanding by incrementally constructing a hierarchical graph memory and employing an agentic tool-augmented retrieval mechanism that constrains the reasoning context to 2% of full-context ingestion while delivering a 12.5-point absolute accuracy gain and state-of-the-art results across four benchmarks, thereby establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Key Contributions

  • This paper introduces MEMDREAMER, a framework that decouples perception and reasoning to mitigate token explosion in long-video processing via a Hierarchical Graph Memory. The three-tier architecture progressively abstracts video semantics into a foundational graph that preserves long-range spatiotemporal and causal dependencies while suppressing irrelevant details.
  • The approach replaces passive context ingestion with a tool-augmented Agentic Retrieval Mechanism that operates through an iterative Observation-Reason-Action loop. By leveraging dedicated navigation, search, and traversal tools, the reasoning model actively explores the memory hierarchy to locate logically relevant information.
  • Experiments demonstrate state-of-the-art performance across four mainstream long-video benchmarks, narrowing the human expert gap to 3.7 points. The framework constrains the reasoning context window to 2% of full-length video data while delivering a 12.5 point absolute accuracy gain and establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Introduction

The authors tackle long video understanding, a foundational capability required to scale Vision-Language Models toward embodied intelligence and complex real-world interaction. Existing methods typically couple visual perception and logical reasoning within a single context window, which triggers severe token explosion, attention dilution, and the "lost in the middle" phenomenon when processing hours-long footage. Even decoupled systems that store video data in flat or chunk-based memory banks struggle to preserve global context and temporal-causal relationships, often degrading retrieval into inefficient trial and error. To overcome these bottlenecks, the authors introduce MEMDREAMER, a framework that separates perception from reasoning by streaming video content into a three-tier Hierarchical Graph Memory. They then leverage a tool-augmented agentic retrieval mechanism that actively navigates this topological structure through an iterative observation-reason-action loop, enabling precise multi-step information extraction while consuming only a fraction of the standard context window.

Dataset

  • Dataset Composition and Sources: The authors organize raw video content into a structured, multi-granular graph memory. The hierarchy spans a root level that synthesizes the full video, mid-level Super Events that group related segments, and fine-grained Macro Events that capture specific actions or moments.

  • Subset Details and Clustering Rules: Super Events are formed by clustering adjacent Macro Events based on temporal continuity, shared scene or topic, and a common overarching goal. Boundaries are triggered whenever any of these conditions break. The clustering leverages signals like overlapping key entities, event type continuity, and OCR text cues such as scoreboard increments. Each Super Event typically contains three to eight Macro Events, though isolated singletons are permitted for self-contained segments.

  • Model Usage and Processing Pipeline: The authors deploy this graph as a proactive retrieval space for an agentic reasoning model rather than a conventional training corpus. The system operates within an Observation-Reason-Action loop, where the model sequentially queries the graph using a three-part tool bank for hierarchical navigation, precise semantic search, and local graph traversal. To prevent context dilution, the pipeline distills raw tool outputs into task-relevant evidence cues before appending them to the agent working memory.

  • Metadata Construction and Formatting: The authors enforce strict formatting guidelines to maintain consistency across the graph. Super Event labels are kept under ten words and include the central entity when applicable, while descriptions span one hundred to two hundred words. Key entities are deduplicated within each Super Event and canonicalized globally to collapse name variants into single identifiers. The root level metadata includes a concise title, a multi-sentence summary of all Super Events, three to five thematic tags, five to ten canonical entities, and two to three emotional tone descriptors.

Method

MEMDREAMER formulates long-video understanding as a decoupled process comprising two distinct stages: persistent memory construction and tool-augmented agentic retrieval. The framework begins by processing a video stream VVV through a perception model P\mathcal{P}P, which operates in a streaming fashion to construct a structured, purely textual Hierarchical Graph Memory, denoted as G\mathcal{G}G. This memory is a three-tiered, coarse-to-fine semantic topology designed to abstract and organize video content efficiently. The architecture is composed of a Video Root at the apex, representing the overall narrative, followed by a layer of Super Events that encapsulate broader narrative arcs, and a foundational layer of Macro Events that serve as the primary semantic anchors. Beneath each Macro Event, a local subgraph is constructed to capture fine-grained details.

The memory construction process unfolds in three phases. First, a streaming adaptive segmentation mechanism is employed to partition the video into semantically self-contained Macro Events. This approach avoids the arbitrary truncation common in fixed-length chunking by using semantic boundaries, ensuring each segment is coherent and maintaining the maximum input length to the perception model within a defined horizon τ\tauτ. This adaptive partitioning yields a sequence of Macro Events that form the leaf nodes of the hierarchy.

Second, for each segmented Macro Event, a downward subgraph extraction process constructs a local spatiotemporal subgraph gi=(Vi,Ei)g_i = (\mathcal{V}_i, \mathcal{E}_i)gi=(Vi,Ei). This subgraph is built using a novel Video-centric Graph Ontology where vertices consist of both standard entities (ViE\mathcal{V}_i^EViE) and micro-events (ViM\mathcal{V}_i^MViM). These vertices are interconnected via a heterogeneous edge set Ei\mathcal{E}_iEi that includes spatial-attribute edges for layout and physical properties, subject-object edges for action role binding, and directed temporal-causal edges to model the chronological and causal flow of micro-events. This design allows the model to capture the dynamic evolution and causal chains within an event, which is a limitation of simpler triplet-based graph representations.

Finally, an upward hierarchical aggregation process distills the detailed information from the Macro Event layer into a global navigation backbone. The textual descriptions of all Macro Events are fed into the perception model P\mathcal{P}P, which clusters and aggregates them based on temporal adjacency and semantic affinity. This bottom-up process forms Super Events and ultimately converges at a single Video Root node, creating a coarse-to-fine hierarchy that enables efficient global navigation and precise localization for retrieval.

Upon receiving a text query QQQ, a reasoning model R\mathcal{R}R equipped with a tool bank T\mathcal{T}T actively explores the pre-constructed hierarchical graph memory. The model operates exclusively on this textual memory, never accessing raw video data. The exploration is guided by an Observation-Reason-Action loop, where the reasoning model first observes the current state, reasons about the next necessary step, and then executes an action by calling one of the available tools. The toolkit is organized into three categories: navigation tools for traversing the hierarchy (e.g., getting summaries, listing events), search tools for finding specific nodes (e.g., semantic search, time-based search), and graph traversal tools for exploring local neighborhoods (e.g., retrieving relations between nodes). This agentic retrieval mechanism allows the model to iteratively refine its search, drill down into relevant subgraphs, and synthesize a concise set of task-relevant clues CCC to produce the final answer A=R(Q,C)A = \mathcal{R}(Q, C)A=R(Q,C).

Experiment

Evaluated across four diverse long-video benchmarks against vanilla vision-language models and memory-based baselines, the main experiments validate the broad superiority and plug-and-play compatibility of a decoupled architectural paradigm. Context reduction analyses further verify that segregating perception from reasoning effectively mitigates attention dilution and token redundancy compared to direct end-to-end ingestion. Finally, qualitative case studies confirm that explicit textual graph navigation successfully captures complex multi-step causal relationships that raw visual processing typically misses, establishing structured agentic reasoning as a robust alternative to brute-force context scaling.

The authors compare different reasoning models using various perception models on the LVBench benchmark, showing that the choice of perception model significantly impacts performance. Results indicate that Gemini-3.1-Pro consistently outperforms Gemini-2.5-Pro across all reasoning models, and the Qwen3-VL-235B-A22B-Thinking model achieves competitive results when paired with the stronger perception model. Gemini-3.1-Pro consistently achieves higher scores than Gemini-2.5-Pro across all reasoning models on LVBench. The Qwen3-VL-235B-A22B-Thinking model performs comparably to Gemini models when used with the Gemini-3.1-Pro perception model. Performance differences between reasoning models are more pronounced when paired with Gemini-2.5-Pro than with Gemini-3.1-Pro.

The authors compare their framework, MemDreamer, with end-to-end models across multiple benchmarks, demonstrating superior accuracy while using significantly shorter context lengths. Results show that MemDreamer achieves higher performance with much reduced input token requirements, indicating that decoupling perception from reasoning improves efficiency and effectiveness. The framework outperforms baseline models on LVBench across different reasoning engines, with notable gains in accuracy and context reduction. MemDreamer achieves higher accuracy than end-to-end models while using significantly shorter context lengths. The framework reduces context size by up to 43 times compared to full-video ingestion methods. MemDreamer maintains strong performance across different reasoning engines, showing its generality and effectiveness.

The authors evaluate their proposed MEMDREAMER framework on a long-video benchmark, comparing different memory architectures. Results show that combining hierarchical and graph-based memory components achieves the highest performance, demonstrating the effectiveness of their structured approach. The framework significantly outperforms other configurations, indicating that both memory organization and graph representation are crucial for long-video understanding. Combining hierarchical and graph memory components yields the best performance on the benchmark. The proposed framework outperforms configurations using only flat or hierarchical memory. The results validate the importance of structured memory design in long-video reasoning tasks.

The authors evaluate a framework that enhances long-video understanding by integrating search, graph traversal, and hierarchical navigation tools into a memory-based system. Results show that adding these components sequentially improves performance on a benchmark, with the full toolkit achieving the highest score, indicating that structured, tool-augmented reasoning outperforms direct full-context processing. Adding search capabilities improves performance over a baseline with no tools. Incorporating graph traversal further boosts performance beyond search alone. The full toolkit with hierarchical navigation achieves the highest score, demonstrating the effectiveness of a multi-step, structured reasoning approach.

The authors investigate the impact of varying maximum context lengths on performance across a benchmark, observing that performance improves with increased context length up to a point before stabilizing. Results show that increasing the context length leads to higher scores and more reasoning rounds, with the highest score achieved at a context length of 12, after which performance plateaus. The number of tokens per round also increases with context length, reflecting the expanded input size. Performance improves with increasing context length up to a threshold, after which it stabilizes. The highest score is achieved at a context length of 12, indicating an optimal balance between input size and reasoning effectiveness. Higher context lengths require more tokens per reasoning round, reflecting increased computational load.

The experiments evaluate the MEMDREAMER framework on long-video understanding benchmarks, demonstrating that decoupling perception from reasoning significantly improves accuracy while drastically reducing context requirements. Comparative and ablation studies validate that pairing stronger perception models with advanced reasoning architectures yields the greatest performance gains, while integrating hierarchical and graph-based memory structures alongside sequential tool-augmented reasoning substantially outperforms flat baselines and direct context processing. Finally, context length analysis reveals an optimal threshold beyond which performance stabilizes, confirming that structured, memory-driven reasoning effectively balances computational efficiency with sustained analytical quality.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp