HyperAIHyperAI

Command Palette

Search for a command to run...

OneStory: 적응형 메모리를 활용한 일관성 있는 다중 샷 영상 생성

초록

실제 영상에서의 서사 전개는 일반적으로 여러 장면(샷)을 통해 이루어지며, 이는 서로 연결되지 않은 연속적인 클립이지만 의미적으로 연결되어 일관된 서사를 전달한다. 그러나 기존의 다중 샷 영상 생성(MSV) 기법은 제한된 시간 창이나 단일 핵심 프레임 조건화에 의존함으로써 장거리 샷 간의 맥락을 효과적으로 모델링하지 못해, 복잡한 서사 상황에서는 성능이 저하된다. 본 연구에서는 일관성 있고 확장 가능한 서사 생성을 위한 전역적이고 컴팩트한 샷 간 맥락 모델링을 가능하게 하는 OneStory를 제안한다. OneStory는 MSV를 다음 샷 생성 문제로 재정의함으로써, 자동 회귀적 샷 합성과 사전 훈련된 이미지-투-비디오(I2V) 모델을 활용한 강력한 시각적 조건화를 동시에 가능하게 한다. 본 연구에서는 두 가지 핵심 모듈을 도입한다: 과거 샷들에서 정보량이 높은 프레임을 기반으로 의미적으로 관련된 전역 메모리를 구성하는 Frame Selection 모듈과, 중요도 기반의 패치화(patchification)를 수행하여 직접 조건화에 사용 가능한 컴팩트한 맥락을 생성하는 Adaptive Conditioner 모듈이다. 또한 실제 서사 패턴을 반영하기 위해 참조 캡션(referential captions)을 포함한 고품질 다중 샷 데이터셋을 구축하고, 다음 샷 프레임워크 하에서 효과적인 훈련 전략을 설계하였다. 자체적으로 수집한 6만 개의 데이터셋으로 사전 훈련된 I2V 모델을 미세조정한 OneStory는 텍스트 및 이미지 조건화 환경 모두에서 다양한 복잡한 장면에서 최신 기술 수준의 서사 일관성을 달성하며, 제어 가능하고 몰입감 있는 장편 영상 서사 생성을 가능하게 한다.

Summarization

Meta AI and University of Copenhagen researchers propose OneStory, a novel framework for multi-shot video generation that models global cross-shot context via a Frame Selection module and Adaptive Conditioner, enabling scalable, coherent long-form storytelling by reformulating the task as autoregressive next-shot prediction using pretrained image-to-video models.

Key Contributions

  • The paper introduces a novel framework for multi-shot video generation that addresses the limitations of existing methods by enabling consistent and scalable narrative modeling through adaptive keyframe selection and cross-shot context propagation.
  • It proposes an effective training strategy using a high-quality dataset of multi-shot videos with referential captions, allowing the model to maintain narrative coherence across discontinuous scenes.
  • By reformulating multi-shot generation as a next-shot prediction task, the approach enhances long-range temporal consistency and supports dynamic evolution of story elements, overcoming the issue of memory loss in prior single-keyframe methods.

Introduction

The authors leverage recent progress in diffusion transformers for video generation to address the challenge of multi-shot video synthesis, where models typically generate only single continuous scenes, limiting their use in real-world storytelling applications. Existing methods either rely on fixed-window attention mechanisms or keyframe conditioning, both of which suffer from memory loss and narrative inconsistency due to the limited context window as shots progress.

  • The proposed OneStory framework introduces three key innovations: (1) a Frame Selection module that identifies semantically relevant frames across all prior shots, (2) an Adaptive Conditioner that dynamically patchifies and injects contextual information into the generator based on frame importance, and (3) a training strategy using unified three-shot sequences and progressive coupling to enhance narrative coherence and scalability.

Dataset

  • The authors use a high-quality multi-shot video dataset consisting of approximately 60,000 videos, including 50,000 two-shot and 10,000 three-shot sequences, all centered on human-centric activities and sourced from research-copyrighted videos.
  • Shot boundaries are detected using TransNetV2, and only videos with at least two shots are retained. Each shot is captioned in two stages: first independently, then rewritten with reference to the prior shot’s content and caption, introducing referential expressions (e.g., “the same man”) to ensure narrative continuity.
  • Captioning is performed using a vision-language model to generate coherent, context-aware descriptions.
  • The dataset undergoes multi-stage filtering: keyword filters remove inappropriate content; CLIP and SigLIP filter out videos with irrelevant shot transitions; DINOv2 eliminates videos with overly similar shots, ensuring narrative progression and visual diversity.
  • To enable stable training, the authors unify all samples into a three-shot format. For two-shot videos, a synthetic middle or first shot is generated either by inserting a random shot from another video or by augmenting the first shot with spatial or color transformations.
  • The final shot in each sequence—real or synthetic—is used as the prediction target. The model is trained to generate this last shot conditioned on the first two shots and the corresponding caption, using a rectified-flow diffusion loss.
  • For evaluation, the authors curate a benchmark of 64 six-shot test cases each for text-to-multi-shot video (T2MSV) and image-to-multi-shot video (I2MSV) generation, covering three storytelling patterns: main-subject consistency, insert-and-recall with an intervening shot, and composable generation, enabling comprehensive assessment of narrative coherence and cross-shot reasoning.

Method

The authors leverage a unified autoregressive framework for multi-shot video generation, reformulating the task as a next-shot prediction problem conditioned on prior visual context and a narrative caption. The model, named OneStory, is initialized from a pretrained image-to-video diffusion model and fine-tuned on a curated 60K dataset. The core architecture, as shown in the figure below, comprises three key components: a Frame Selection module, an Adaptive Conditioner, and a DiT-based diffusion backbone, which operate in tandem during both training and inference.

During generation, for the iii-th shot, the model takes as input the caption CiC_iCi, the latent representations of all preceding shots {Sj}j=1i1\{S_j\}_{j=1}^{i-1}{Sj}j=1i1, and a noise tensor. The 3D VAE encoder first compresses each prior shot into a latent sequence, which is concatenated into a global historical memory MRF×Ns×Dv\mathbf{M} \in \mathbb{R}^{F \times N_s \times D_v}MRF×Ns×Dv, where FFF is the total number of frames across prior shots and NsN_sNs is the spatial token count per frame. The Frame Selection module then identifies the most semantically relevant frames from this memory. It employs mmm learnable query tokens that first attend to the projected text features of CiC_iCi to capture the current shot’s intent, and then attend to the projected visual memory M1\mathbf{M}_1M1 to extract visual cues. Frame-wise relevance scores SRF\mathbf{S} \in \mathbb{R}^{F}SRF are computed via a projection and mean aggregation over query-memory interactions. The top-KselK_\mathrm{sel}Ksel frames are selected based on these scores to form a compact context memory M^\widehat{\mathbf{M}}M.

The Adaptive Conditioner processes this selected memory to generate a set of context tokens C\mathbf{C}C that are efficiently injected into the diffusion process. It employs a set of patchifiers {P}=1Lp\{\mathcal{P}_\ell\}_{\ell=1}^{L_p}{P}=1Lp with varying kernel sizes. Based on the relevance scores S\mathbf{S}S, frames in M^\widehat{\mathbf{M}}M are adaptively assigned to patchifiers: highly relevant frames are processed with finer, less compressive patchifiers, while less relevant ones use coarser ones. This content-driven allocation, illustrated in the figure below, contrasts with fixed temporal partitioning and ensures that critical visual information is preserved with minimal computational overhead.

The resulting context tokens C\mathbf{C}C are concatenated with the noisy latent tokens N\mathbf{N}N of the current shot to form the input X\mathbf{X}X for the DiT backbone. This concatenation enables joint attention between noisy and context tokens, facilitating rich cross-attention interactions that guide the denoising process. The model is trained with a joint objective that combines a standard shot generation loss Lshot\mathcal{L}_\mathrm{shot}Lshot with a supervision loss Lsel\mathcal{L}_\mathrm{sel}Lsel for the frame relevance scores. The latter is computed using pseudo-labels derived from DINOv2 and CLIP embeddings for real frames, and heuristic labels for synthetic frames, ensuring the Frame Selection module learns to prioritize semantically aligned context.

Experiment

  • Shot inflation and Decoupled Conditioning improve narrative consistency, while both baselines fail, highlighting the effectiveness of our adaptive memory in maintaining stable long-range identity cues.
  • Shot inflation and Decoupled Conditioning improve narrative consistency, while both baselines fail, highlighting the effectiveness of our adaptive memory in maintaining stable long-range identity cues.
  • Shot inflation and Decoupled Conditioning improve narrative consistency, while both baselines fail, confirming their complementary roles in cross-shot context modeling.

The authors use OneStory to generate multi-shot videos under both text- and image-conditioned settings, evaluating performance across inter-shot coherence, semantic alignment, intra-shot coherence, aesthetic quality, and dynamic degree. Results show OneStory consistently outperforms all baselines in both settings, achieving the highest scores across nearly all metrics, particularly in character and environment consistency, semantic alignment, and subject-background fidelity. This demonstrates its superior ability to maintain narrative continuity and visual quality across multiple shots.

The authors evaluate the impact of their Adaptive Conditioner (AC) and Frame Selection (FS) modules through ablation, showing that combining both yields the highest character consistency, environment consistency, and semantic alignment. Results confirm that each component contributes independently, with their joint use delivering the strongest narrative coherence.

The authors evaluate the impact of their training strategies, showing that Shot Inflation alone improves environment consistency and semantic alignment, while combining it with Decoupled Conditioning yields the highest scores across all metrics, confirming the effectiveness of their two-stage curriculum for stable optimization and narrative coherence.

The authors evaluate the impact of context token length on cross-shot consistency, finding that increasing from one to three latent-frame equivalents improves both character and environment consistency. Results show a steady performance gain with more context tokens, confirming the efficiency of their adaptive memory in modeling temporal dynamics.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp