HyperAIHyperAI

Command Palette

Search for a command to run...

見えずとも忘れず:動的動画世界モデルのためのハイブリッドメモリ

Kaijin Chen Dingkang Liang Xin Zhou Yikang Ding Xiaoqiang Liu Pengfei Wan Xiang Bai

概要

動画ワールドモデルは物理世界のシミュレーションにおいて莫大な可能性を示してきましたが、既存のメモリメカニズムは主に環境を静的なキャンバスとして扱う傾向にあります。動的な主体が視界外に隠れ、その後再出現する状況において、現在の手法は往々にして困難に直面し、主体が凍結したり歪んだり、あるいは消滅したりするという問題が生じます。この課題に対処するため、本研究ではハイブリッドメモリ(Hybrid Memory)と呼ばれる新たなパラダイムを導入します。これは、モデルが静的な背景に対しては正確なアーキビストとして、動的な主体に対しては警戒心の強いトラッカーとして同時に機能することを要求し、視界外にある間にも運動の連続性を保証するものです。この方向性の研究を促進するため、ハイブリッドメモリに特化した初の大規模動画データセットであるHM-Worldを構築しました。本データセットは、カメラ軌道と主体軌道が分離された59,000件の高忠実度クリップを含み、17の多様なシーン、49の異なる主体、およびハイブリッドコヒーレンスを厳密に評価するために綿密に設計された退出・進入イベントを網羅しています。さらに、メモリをトークンに圧縮し、時空間関連性駆動型の検索メカニズムを活用する専用メモリアーキテクチャであるHyDRAを提案します。関連する運動の手がかりを選択的に注視することで、HyDRAは隠れた主体の同一性と運動を効果的に維持します。HM-Worldにおける広範な実験により、本手法は動的な主体の整合性および全体的な生成品質の両面において、最先端の手法を大幅に凌駕することが示されました。

One-sentence Summary

Researchers from Huazhong University of Science and Technology and Kling Team propose Hybrid Memory, a paradigm for video world models that maintains static backgrounds while tracking dynamic subjects. Their HyDRA architecture uses spatiotemporal retrieval to preserve motion consistency during out-of-view intervals, validated on the new HM-World dataset.

Key Contributions

  • The paper introduces Hybrid Memory, a novel paradigm that requires models to simultaneously maintain spatial consistency for static backgrounds and motion continuity for dynamic subjects during out-of-view intervals.
  • This work presents HM-World, the first large-scale video dataset dedicated to hybrid memory research, featuring 59K high-fidelity clips with decoupled camera and subject trajectories to rigorously evaluate spatiotemporal coherence.
  • A specialized memory architecture named HyDRA is proposed, which compresses memory into tokens and employs a spatiotemporal relevance-driven retrieval mechanism to effectively rediscover hidden subjects and preserve their identity and motion.

Introduction

Video world models are critical for applications like autonomous driving and embodied intelligence, yet current memory mechanisms treat environments as static canvases that fail when dynamic subjects move out of view. Existing approaches often cause hidden characters to vanish, freeze, or distort upon re-emergence because they lack the ability to track independent motion logic during occlusion. The authors introduce Hybrid Memory, a new paradigm that requires models to simultaneously archive static backgrounds and predict the unseen trajectories of dynamic subjects. To support this, they release HM-World, the first large-scale dataset featuring decoupled camera and subject movements, and propose HyDRA, a specialized architecture that uses spatiotemporal relevance-driven retrieval to preserve identity and motion continuity for hidden entities.

Dataset

  • Dataset Composition and Sources: The authors introduce HM-World, a large-scale synthetic dataset built to address the scarcity of natural videos featuring exit-entry events. It is generated entirely within Unreal Engine 5 by procedurally combining four core dimensions: 17 diverse 3D scenes, 49 distinct subjects (humans and animals), 10 predefined subject trajectories, and 28 designed camera trajectories.

  • Key Details for Each Subset: The final collection consists of 59,225 high-fidelity video clips. Each clip features 1 to 3 subjects moving along random paths while the camera executes deliberate back-and-forth motions to force subjects to leave and re-enter the frame. The dataset is unique in its inclusion of specific in-and-out-of-frame dynamics, unlike existing datasets that either lack dynamic subjects, keep subjects always visible, or use static cameras.

  • Usage in the Model: This dataset serves as a dedicated testing ground and training resource for Hybrid Memory in Video World Models. It enables the model to learn spatiotemporal decoupling by simultaneously anchoring static backgrounds and tracking dynamic subjects that disappear and reappear, a capability essential for maintaining visual identity and consistent motion states during out-of-view extrapolation.

  • Processing and Metadata Construction: The rendering pipeline filters out any clips that fail to produce exit-entry events. Every retained sample is comprehensively annotated with the rendered video, a descriptive caption generated by MiniCPM-V, precise camera poses, per-frame 3D positions for all subjects, and exact timestamps marking when each subject exits and enters the frame.

Method

The authors address the challenge of generating consistent video sequences where dynamic subjects frequently exit and re-enter the camera's field of view. As illustrated in the conceptual diagram, maintaining static, appearance, and motion consistency across time steps (T1 to T5) is critical when subjects are occluded or out of sight. To achieve high-fidelity future frame prediction, the model must preserve the static background while actively seeking moving subjects to maintain their appearance and motion consistency.

The overall framework is built upon a full-sequence video diffusion model. Refer to the framework diagram for the complete pipeline structure. The architecture comprises a causal 3D VAE for spatiotemporal compression and a Diffusion Transformer (DiT) for generation. The model follows Flow Matching, where the diffusion timestep is encoded via an MLP to modulate the DiT blocks. During training, the model learns to predict the ground-truth velocity vt=z0z1v_t = z_0 - z_1vt=z0z1 at timestep t[0,1]t \in [0, 1]t[0,1], minimizing the loss function:

Lθ=Ez0,z1,tu(zt,t;θ)vt2\mathcal{L}_{\theta} = \mathbb{E}_{z_0, z_1, t} || u(z_t, t; \theta) - v_t ||^2Lθ=Ez0,z1,t∣∣u(zt,t;θ)vt2

To enable precise spatial control, camera trajectories are injected as an explicit condition. The camera pose sequence is flattened and encoded via a camera encoder, then added element-wise to the latent features fed into the DiT blocks.

To handle dynamic subjects efficiently without flooding the model with irrelevant noise, the authors introduce HyDRA (Hybrid Dynamic Retrieval Attention). This module replaces standard self-attention layers and consists of two key components: Memory Tokenization and Dynamic Retrieval Attention.

First, a Memory Tokenizer processes the encoded memory latents ZmemZ_{mem}Zmem. Instead of using raw latents, a 3D-convolution-based tokenizer expands the spatiotemporal receptive field to capture long-duration motion information, producing compact memory tokens MMM. This transformation is defined as M=Tmem(Zmem)M = \mathcal{T}_{mem}(Z_{mem})M=Tmem(Zmem).

Second, the Dynamic Retrieval Attention mechanism computes a spatiotemporal affinity metric between the target query and the memory tokens. As shown in the detailed module breakdown, the system performs a Top-K selection to retrieve the most relevant memory tokens based on affinity scores Si,jS_{i,j}Si,j. To preserve local denoising stability, the retrieved memory features are concatenated with keys and values from a local temporal window. The final attention is computed using the standard formulation:

Attention(qi,Ki,Vi)=Softmax(qi(Ki)Td)Vi\mathrm{Attention}(q_i, K_i', V_i') = \mathrm{Softmax}\left( \frac{q_i (K_i')^T}{\sqrt{d}} \right) V_i'Attention(qi,Ki,Vi)=Softmax(dqi(Ki)T)Vi

By iterating this process, the model selectively attends to pertinent motion and appearance cues of out-of-sight subjects, ensuring spatiotemporal consistency while reducing computational burden.

Experiment

  • Main experiments compare the proposed HyDRA method against baselines and state-of-the-art models, validating its superior ability to maintain subject identity and motion coherence during complex exit-and-re-entry events.
  • Qualitative results demonstrate that while competing methods suffer from subject distortion, vanishing, or stuttering, HyDRA successfully preserves hybrid consistency by effectively anchoring static backgrounds and tracking dynamic subjects.
  • Ablation studies confirm that temporal interaction within the memory tokenizer is critical for capturing long-term dynamics, as removing it causes significant consistency failures.
  • Experiments on token retrieval show that dynamic affinity-based selection outperforms static Field of View filtering by adaptively retrieving keyframes with rich subject details rather than relying on fixed geometric overlap.
  • Analysis of retrieved token counts indicates that a moderate number of tokens is sufficient to provide necessary spatiotemporal context, whereas overly restricted counts lead to information loss and generation artifacts.
  • Open-domain evaluations verify that the model generalizes well to unseen scenes and camera movements, maintaining robust memory capabilities without specific fine-tuning.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています