HyperAIHyperAI

Command Palette

Search for a command to run...

Memory-V2V: 메모리로 증강된 비디오 투 비디오 확산 모델

Dohun Lee Chun-Hao Paul Huang Xuelin Chen Jong Chul Ye Duygu Ceylan Hyeonho Jeong

초록

최근의 기초적인 비디오-비디오 확산 모델들은 사용자가 제공한 비디오의 외형, 움직임 또는 카메라 이동을 수정함으로써 비디오 편집에서 놀라운 성과를 달성해왔다. 그러나 실제 비디오 편집은 종종 사용자가 다수의 상호작용 라운드를 거쳐 결과를 점진적으로 개선하는 반복적 과정이다. 이러한 다단계 상호작용 환경에서 현재의 비디오 편집 도구들은 연속적인 편집 간의 교차 일관성(cross-consistency)을 유지하는 데 어려움을 겪는다. 본 연구에서는 다단계 비디오 편집에서의 교차 일관성 문제를 처음으로 다루며, 기존 비디오-비디오 모델에 명시적인 메모리를 추가하는 간단하면서도 효과적인 프레임워크인 Memory-V2V를 제안한다. 외부 캐시에 저장된 이전에 편집된 비디오들을 기반으로, Memory-V2V는 정확한 검색 및 동적 토큰화 전략을 활용하여 현재 편집 단계를 이전 결과에 조건화한다. 또한 중복성과 계산 부담을 줄이기 위해 DiT 백본 내부에 학습 가능한 토큰 압축기(learnable token compressor)를 도입하여 중복된 조건 토큰을 압축하면서도 핵심 시각적 특징을 유지함으로써 전반적인 처리 속도를 약 30% 향상시켰다. 우리는 비디오 신규 시점 합성 및 텍스트 조건부 장시간 비디오 편집과 같은 도전적인 작업들에서 Memory-V2V의 성능을 검증하였다. 광범위한 실험 결과는 Memory-V2V가 최소한의 계산 부담으로 전반적으로 훨씬 더 높은 교차 일관성을 갖는 비디오를 생성함과 동시에, 최첨단 기준 모델 대비 작업에 특화된 성능을 유지하거나 향상시킴을 보여주었다. 프로젝트 페이지: https://dohunlee1.github.io/MemoryV2V

One-sentence Summary

Adobe Research and KAIST researchers propose Memory-V2V, a framework enhancing video-to-video diffusion models with explicit memory for iterative editing, using dynamic tokenization and a learnable compressor to maintain cross-consistency across edits while accelerating inference by 30% for tasks like novel view synthesis and long video editing.

Key Contributions

  • Memory-V2V introduces the first framework for cross-consistent multi-turn video editing, addressing the real-world need for iterative refinement by augmenting diffusion models with explicit visual memory to preserve consistency across sequential edits.
  • The method employs task-specific retrieval, dynamic tokenization, and a learnable token compressor within the DiT backbone to condition current edits on prior results while reducing computational overhead by 30% through adaptive compression of less relevant tokens.
  • Evaluated on video novel view synthesis and text-guided long video editing, Memory-V2V outperforms state-of-the-art baselines in cross-iteration consistency and maintains or improves task-specific quality with minimal added cost.

Introduction

The authors leverage existing video-to-video diffusion models to address the real-world need for iterative video editing, where users refine outputs over multiple interactions. Prior methods fail to maintain cross-consistency across edits—whether synthesizing novel views or editing long videos—because they lack mechanisms to recall and align with previous generations. Memory-V2V introduces explicit visual memory by retrieving relevant past edits from an external cache, dynamically tokenizing them based on relevance, and compressing redundant tokens within the DiT backbone to reduce computation by 30% while preserving visual fidelity. This enables consistent, multi-turn editing without sacrificing speed or quality, advancing toward practical, memory-aware video tools.

Dataset

  • The authors construct a long-form video editing dataset by extending short clips from Señorita-2M, which provides 33-frame editing pairs with stable local edits. Each clip is extended by 200 frames using FramePack, yielding 233-frame sequences for training.

  • The extended 200 frames serve as memory during training. At each iteration, the model randomly samples segments from this extended portion to condition on past context, enabling long-horizon editing.

  • For positional encoding, they use RoPE with hierarchical temporal indexing: target frames get indices 0 to T−1, the immediately preceding segment gets T to 2T−1, and remaining memory segments get 2T to 3T−1. This preserves continuity with recent context while incorporating broader history.

  • To resolve the training-inference gap, they reverse the RoPE index order for memory frames during inference. This aligns the positional structure with training, ensuring consistency between how memory frames are indexed during training (chronological) and inference (reverse chronological).

  • The dataset enables efficient training of Memory-V2V, reducing FLOPs and latency by over 90% while scaling gracefully with memory video count.

Method

The Memory-V2V framework is designed to enable multi-turn video editing by maintaining cross-edit consistency through a hybrid retrieval and compression strategy. The overall architecture, as illustrated in the framework diagram, builds upon a pretrained video-to-video diffusion model, such as ReCamMaster for novel view synthesis or LucyEdit for text-guided editing, and introduces mechanisms to efficiently incorporate prior editing history. At each editing iteration, the model generates a new video conditioned on the current input and a set of relevant past videos retrieved from an external cache. This cache stores the latent representations of previously generated videos, indexed by their camera trajectories for novel view synthesis or by their source video segments for text-guided editing. The retrieval process ensures that only the most relevant historical videos are considered, mitigating the computational burden of conditioning on an ever-growing history.

The core of the framework lies in its dynamic tokenization and adaptive token merging components. For video novel view synthesis, relevance is determined by a VideoFOV retrieval algorithm that quantifies the geometric overlap between the field-of-view (FOV) of the target camera trajectory and those of the cached videos. This is achieved by sampling points on a unit sphere centered at the first camera position and determining visibility within the projected image bounds for each frame. The video-level FOV is the union of all frame-level FOVs, and two complementary similarity metrics—overlap and containment—are used to compute a final relevance score. The top-k most relevant videos are retrieved and dynamically tokenized using learnable tokenizers with varying spatio-temporal compression factors. Specifically, the user-input video is tokenized with a 1×2×21 \times 2 \times 21×2×2 kernel, the top-3 most relevant retrieved videos with a 1×4×41 \times 4 \times 41×4×4 kernel, and the remaining videos with a 1×8×81 \times 8 \times 81×8×8 kernel. This adaptive tokenization strategy allocates the token budget efficiently, preserving fine-grained details for the most relevant videos while managing the total token count.

To further enhance computational efficiency, the framework employs adaptive token merging. This strategy leverages the observation that DiT attention maps are inherently sparse, with only a small subset of tokens contributing meaningfully to the output. The responsiveness of each frame is estimated by computing its maximum attention response to the target queries. Frames with low responsiveness scores are identified as containing redundant information and are merged using a learnable convolutional operator. The merging is applied at specific points in the DiT architecture—Block 10 and Block 20—where responsiveness scores have stabilized, ensuring that essential context is preserved while reducing redundancy. This approach avoids the degradation that would result from completely discarding low-importance tokens.

For text-guided long video editing, the framework extends the multi-turn editing paradigm by reformulating the task as an iterative process. Given a long input video, it is divided into shorter segments that fit within the base model's temporal context. During the editing of each segment, the model retrieves the most relevant previously edited segments from the cache based on the similarity of their corresponding source video segments, using DINOv2 embeddings. The retrieved videos are then dynamically tokenized and processed with adaptive token merging. The edited segments are stitched together to form the final output video. This approach ensures consistency across the entire long-form video, as demonstrated by the ability to consistently add the same object or transform a specific element across all segments.

The framework also addresses the challenge of positional encoding during multi-turn editing. To prevent temporal drift and inconsistencies when generating videos longer than the training horizon or incorporating expanding conditioning sets, a hierarchical RoPE (Rotary Position Embedding) design is employed. The target, user-input, and memory videos are assigned disjoint ranges of temporal RoPE indices. A mixed training strategy, including Gaussian noise perturbation for memory tokens and RoPE dropout for user-input tokens, is used to ensure the model can correctly interpret and utilize this hierarchical structure during inference. Additionally, camera conditioning is made explicit by embedding camera trajectories on a per-video basis, allowing the model to handle heterogeneous viewpoints and improve viewpoint reasoning.

Experiment

  • Evaluated context encoders for multi-turn novel view synthesis: Video VAE outperformed CUT3R and LVSM in preserving appearance consistency across generations; adopted for Memory-V2V.
  • On 40 videos, Memory-V2V surpassed ReCamMaster (Ind/AR) and TrajectoryCrafter in cross-iteration consistency (MEt3R) and visual quality (VBench), maintaining camera accuracy and outperforming CUT3R/LVSM-based variants.
  • On 50 long videos (Señorita), Memory-V2V beat LucyEdit (Ind/FIFO) in visual quality and cross-frame consistency (DINO/CLIP), enabling coherent edits over 200+ frames.
  • Ablations confirmed dynamic tokenization + retrieval boosts long-term consistency (e.g., 1st–5th gen), while adaptive token merging cuts FLOPs/latency by 30% without quality loss; merging outperformed discarding in motion continuity.
  • Dynamic tokenization reduced FLOPs/latency by >90% vs. uniform tokenization; adaptive merging added further 30% savings, keeping inference time comparable to single-video synthesis.
  • Memory-V2V struggles with multi-shot videos due to scene transitions and accumulates artifacts from imperfect synthetic training extensions; future work includes multi-shot training and diffusion distillation integration.

Results show that cross-block consistency improves with increasing block distance, as indicated by higher Pearson correlation, Spearman correlation, and Bottom-k overlap values from Block 1 vs 2–30 to Block 21 vs 22–30. This suggests that the model maintains stronger semantic consistency across longer temporal intervals.

Results show that combining dynamic tokenization with video retrieval significantly improves subject consistency, while adding adaptive token merging further enhances this metric without degrading aesthetic or imaging quality. The full model, which includes all components, achieves the highest consistency and motion smoothness, demonstrating the effectiveness of the proposed memory management strategy.

Results show that adaptive token merging significantly reduces computational cost, with the model achieving over 30% lower FLOPs and latency compared to the version without merging. The authors use this technique to maintain efficiency even when conditioning on a large number of memory videos, keeping inference time comparable to single-video synthesis.

Results show that Memory-V2V consistently outperforms all baselines across multi-view consistency, camera accuracy, and visual quality metrics, with the best performance in cross-iteration consistency and visual fidelity. The model achieves superior results compared to ReCamMaster and TrajCrafter, particularly in maintaining consistency across multiple generations while preserving motion and appearance quality.

Results show that Memory-V2V outperforms both LucyEdit variants across all metrics, achieving the highest scores in background consistency, aesthetic quality, imaging quality, temporal flickering, and motion smoothness. It also demonstrates superior cross-frame consistency with higher DINO and CLIP similarity values, indicating more coherent and visually consistent long video editing.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp