HyperAIHyperAI

Command Palette

Search for a command to run...

Console
il y a 3 jours

OneStory : Génération cohérente de vidéos multi-tirages avec mémoire adaptative

OneStory : Génération cohérente de vidéos multi-tirages avec mémoire adaptative

Résumé

Le récit dans les vidéos du monde réel s’articule souvent à travers plusieurs plans — des extraits discontinus mais sémantiquement liés qui, ensemble, transmettent une narration cohérente. Toutefois, les méthodes actuelles de génération vidéo multi-plan (MSV) peinent à modéliser efficacement le contexte à longue portée entre les plans, en s’appuyant sur des fenêtres temporelles limitées ou une seule image-clé comme condition, ce qui entraîne une dégradation des performances dans des récits complexes. Dans ce travail, nous proposons OneStory, une approche permettant une modélisation globale mais compacte du contexte entre plans, pour une génération narrative cohérente et évolutif. OneStory reformule la tâche de MSV comme une génération autoregressive du prochain plan, tout en exploitant des modèles préentraînés image-to-video (I2V) pour une condition visuelle puissante. Nous introduisons deux modules clés : un module de sélection de trames, qui construit une mémoire globale sémantiquement pertinente à partir des trames les plus informatives des plans précédents ; et un conditionneur adaptatif, qui réalise une patchification guidée par l’importance afin de générer un contexte compact directement utilisable comme condition. Nous avons également constitué un jeu de données multi-plan de haute qualité, enrichi de légendes référentielles, pour refléter fidèlement les schémas narratifs du monde réel, et conçu des stratégies d’entraînement efficaces dans le cadre du paradigme de génération du prochain plan. En fine-tuning à partir d’un modèle I2V préentraîné sur notre jeu de données de 60 000 exemples, OneStory atteint un état de l’art en termes de cohérence narrative dans des scènes diverses et complexes, tant dans les configurations conditionnées par texte que par image, permettant ainsi une narration vidéo longue, contrôlable et immersive.

Summarization

Meta AI and University of Copenhagen researchers propose OneStory, a novel framework for multi-shot video generation that models global cross-shot context via a Frame Selection module and Adaptive Conditioner, enabling scalable, coherent long-form storytelling by reformulating the task as autoregressive next-shot prediction using pretrained image-to-video models.

Key Contributions

  • The paper introduces a novel framework for multi-shot video generation that addresses the limitations of existing methods by enabling consistent and scalable narrative modeling through adaptive keyframe selection and cross-shot context propagation.
  • It proposes an effective training strategy using a high-quality dataset of multi-shot videos with referential captions, allowing the model to maintain narrative coherence across discontinuous scenes.
  • By reformulating multi-shot generation as a next-shot prediction task, the approach enhances long-range temporal consistency and supports dynamic evolution of story elements, overcoming the issue of memory loss in prior single-keyframe methods.

Introduction

The authors leverage recent progress in diffusion transformers for video generation to address the challenge of multi-shot video synthesis, where models typically generate only single continuous scenes, limiting their use in real-world storytelling applications. Existing methods either rely on fixed-window attention mechanisms or keyframe conditioning, both of which suffer from memory loss and narrative inconsistency due to the limited context window as shots progress.

  • The proposed OneStory framework introduces three key innovations: (1) a Frame Selection module that identifies semantically relevant frames across all prior shots, (2) an Adaptive Conditioner that dynamically patchifies and injects contextual information into the generator based on frame importance, and (3) a training strategy using unified three-shot sequences and progressive coupling to enhance narrative coherence and scalability.

Dataset

  • The authors use a high-quality multi-shot video dataset consisting of approximately 60,000 videos, including 50,000 two-shot and 10,000 three-shot sequences, all centered on human-centric activities and sourced from research-copyrighted videos.
  • Shot boundaries are detected using TransNetV2, and only videos with at least two shots are retained. Each shot is captioned in two stages: first independently, then rewritten with reference to the prior shot’s content and caption, introducing referential expressions (e.g., “the same man”) to ensure narrative continuity.
  • Captioning is performed using a vision-language model to generate coherent, context-aware descriptions.
  • The dataset undergoes multi-stage filtering: keyword filters remove inappropriate content; CLIP and SigLIP filter out videos with irrelevant shot transitions; DINOv2 eliminates videos with overly similar shots, ensuring narrative progression and visual diversity.
  • To enable stable training, the authors unify all samples into a three-shot format. For two-shot videos, a synthetic middle or first shot is generated either by inserting a random shot from another video or by augmenting the first shot with spatial or color transformations.
  • The final shot in each sequence—real or synthetic—is used as the prediction target. The model is trained to generate this last shot conditioned on the first two shots and the corresponding caption, using a rectified-flow diffusion loss.
  • For evaluation, the authors curate a benchmark of 64 six-shot test cases each for text-to-multi-shot video (T2MSV) and image-to-multi-shot video (I2MSV) generation, covering three storytelling patterns: main-subject consistency, insert-and-recall with an intervening shot, and composable generation, enabling comprehensive assessment of narrative coherence and cross-shot reasoning.

Method

The authors leverage a unified autoregressive framework for multi-shot video generation, reformulating the task as a next-shot prediction problem conditioned on prior visual context and a narrative caption. The model, named OneStory, is initialized from a pretrained image-to-video diffusion model and fine-tuned on a curated 60K dataset. The core architecture, as shown in the figure below, comprises three key components: a Frame Selection module, an Adaptive Conditioner, and a DiT-based diffusion backbone, which operate in tandem during both training and inference.

During generation, for the iii-th shot, the model takes as input the caption CiC_iCi, the latent representations of all preceding shots {Sj}j=1i1\{S_j\}_{j=1}^{i-1}{Sj}j=1i1, and a noise tensor. The 3D VAE encoder first compresses each prior shot into a latent sequence, which is concatenated into a global historical memory MRF×Ns×Dv\mathbf{M} \in \mathbb{R}^{F \times N_s \times D_v}MRF×Ns×Dv, where FFF is the total number of frames across prior shots and NsN_sNs is the spatial token count per frame. The Frame Selection module then identifies the most semantically relevant frames from this memory. It employs mmm learnable query tokens that first attend to the projected text features of CiC_iCi to capture the current shot’s intent, and then attend to the projected visual memory M1\mathbf{M}_1M1 to extract visual cues. Frame-wise relevance scores SRF\mathbf{S} \in \mathbb{R}^{F}SRF are computed via a projection and mean aggregation over query-memory interactions. The top-KselK_\mathrm{sel}Ksel frames are selected based on these scores to form a compact context memory M^\widehat{\mathbf{M}}M.

The Adaptive Conditioner processes this selected memory to generate a set of context tokens C\mathbf{C}C that are efficiently injected into the diffusion process. It employs a set of patchifiers {P}=1Lp\{\mathcal{P}_\ell\}_{\ell=1}^{L_p}{P}=1Lp with varying kernel sizes. Based on the relevance scores S\mathbf{S}S, frames in M^\widehat{\mathbf{M}}M are adaptively assigned to patchifiers: highly relevant frames are processed with finer, less compressive patchifiers, while less relevant ones use coarser ones. This content-driven allocation, illustrated in the figure below, contrasts with fixed temporal partitioning and ensures that critical visual information is preserved with minimal computational overhead.

The resulting context tokens C\mathbf{C}C are concatenated with the noisy latent tokens N\mathbf{N}N of the current shot to form the input X\mathbf{X}X for the DiT backbone. This concatenation enables joint attention between noisy and context tokens, facilitating rich cross-attention interactions that guide the denoising process. The model is trained with a joint objective that combines a standard shot generation loss Lshot\mathcal{L}_\mathrm{shot}Lshot with a supervision loss Lsel\mathcal{L}_\mathrm{sel}Lsel for the frame relevance scores. The latter is computed using pseudo-labels derived from DINOv2 and CLIP embeddings for real frames, and heuristic labels for synthetic frames, ensuring the Frame Selection module learns to prioritize semantically aligned context.

Experiment

  • Shot inflation and Decoupled Conditioning improve narrative consistency, while both baselines fail, highlighting the effectiveness of our adaptive memory in maintaining stable long-range identity cues.
  • Shot inflation and Decoupled Conditioning improve narrative consistency, while both baselines fail, highlighting the effectiveness of our adaptive memory in maintaining stable long-range identity cues.
  • Shot inflation and Decoupled Conditioning improve narrative consistency, while both baselines fail, confirming their complementary roles in cross-shot context modeling.

The authors use OneStory to generate multi-shot videos under both text- and image-conditioned settings, evaluating performance across inter-shot coherence, semantic alignment, intra-shot coherence, aesthetic quality, and dynamic degree. Results show OneStory consistently outperforms all baselines in both settings, achieving the highest scores across nearly all metrics, particularly in character and environment consistency, semantic alignment, and subject-background fidelity. This demonstrates its superior ability to maintain narrative continuity and visual quality across multiple shots.

The authors evaluate the impact of their Adaptive Conditioner (AC) and Frame Selection (FS) modules through ablation, showing that combining both yields the highest character consistency, environment consistency, and semantic alignment. Results confirm that each component contributes independently, with their joint use delivering the strongest narrative coherence.

The authors evaluate the impact of their training strategies, showing that Shot Inflation alone improves environment consistency and semantic alignment, while combining it with Decoupled Conditioning yields the highest scores across all metrics, confirming the effectiveness of their two-stage curriculum for stable optimization and narrative coherence.

The authors evaluate the impact of context token length on cross-shot consistency, finding that increasing from one to three latent-frame equivalents improves both character and environment consistency. Results show a steady performance gain with more context tokens, confirming the efficiency of their adaptive memory in modeling temporal dynamics.

Construire l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec du co-codage IA gratuit, un environnement prêt à l'emploi et les meilleurs prix GPU.

Co-codage IA
GPU prêts à utiliser
Meilleurs prix
Commencer

Hyper Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp
OneStory : Génération cohérente de vidéos multi-tirages avec mémoire adaptative | Articles de recherche | HyperAI