HyperAIHyperAI

Command Palette

Search for a command to run...

DreaMontage: 임의 프레임 유도형 원샷 영상 생성

초록

"원샷(One-shot)" 기법은 영화 제작에서 독창적이고 정교한 미학을 대표한다. 그러나 현실적인 구현 과정에서는 막대한 비용과 복잡한 실용적 제약에 의해 흔히 제한된다. 최근 등장한 영상 생성 모델은 가상적 대안을 제공하지만, 기존의 접근 방식은 일반적으로 단순한 클립 연결(clip concatenation)에 의존하여 시각적 부드러움과 시간적 일관성을 자주 유지하지 못한다. 본 논문에서는 다양한 사용자 입력을 바탕으로 원활하고 표현력 있으며 장시간 지속 가능한 원샷 영상을 생성할 수 있는 종합적인 프레임워크인 DreaMontage를 제안한다. 이를 달성하기 위해 세 가지 주요 차원에서 도전 과제를 해결한다. (i) DiT 아키텍처에 경량화된 중간 조건부 조절 메커니즘을 통합함으로써, 기본 학습 데이터를 효과적으로 활용하는 적응형 튜닝(Adaptive Tuning) 전략을 적용하여, 임의의 프레임에 대한 강력한 제어 능력을 확보하였다. (ii) 시각적 정밀도와 영화적 표현력을 향상시키기 위해 고품질 데이터셋을 수집하고, 시각적 표현 강화를 위한 SFT(Supervised Fine-Tuning) 단계를 도입하였다. 특히 주체의 움직임 타당성과 전환의 부드러움과 같은 핵심 문제를 해결하기 위해 맞춤형 DPO(Direct Preference Optimization) 전략을 적용하여 생성 콘텐츠의 성공률과 사용성에 상당한 향상을 이끌어냈다. (iii) 장기간 시퀀스 생성을 용이하게 하기 위해 메모리 효율적인 방식으로 작동하는 세그먼트 단위 자동 회귀(Segment-wise Auto-Regressive, SAR) 추론 전략을 설계하였다. 광범위한 실험을 통해 제안한 방법이 시각적으로 인상적이고 완벽하게 일관된 원샷 효과를 달성하면서도 계산 효율성을 유지함을 입증하였으며, 사용자가 산만한 시각 자료를 생동감 있고 통합된 원샷 영화 경험으로 변환할 수 있도록 지원한다.

One-sentence Summary

ByteDance researchers propose DreaMontage, a framework for seamless one-shot video generation that overcomes prior clip concatenation limitations through adaptive tuning within DiT architecture and segment-wise autoregressive inference, enabling cinematic-quality long sequences from fragmented inputs via visual expression refinement and tailored optimization.

Key Contributions

  • The one-shot filmmaking technique faces prohibitive real-world costs and constraints, while existing video generation methods rely on naive clip concatenation that fails to ensure visual smoothness and temporal coherence across transitions. DreaMontage introduces a lightweight intermediate-conditioning mechanism integrated into the DiT architecture, using an Adaptive Tuning strategy to leverage base training data for robust arbitrary-frame control capabilities.
  • To address critical issues like subject motion rationality and transition smoothness in generated content, the framework curates a high-quality dataset and implements a Visual Expression Supervised Fine-Tuning stage followed by a Tailored DPO scheme. This pipeline significantly improves cinematic expressiveness and the success rate of seamless one-shot video synthesis.
  • For generating extended one-shot sequences under memory constraints, DreaMontage designs a Segment-wise Auto-Regressive inference strategy that enables long-duration production while maintaining computational efficiency. Extensive experiments confirm the approach achieves visually striking, temporally coherent results that transform fragmented inputs into cohesive cinematic experiences.

Introduction

The authors address the challenge of generating seamless "one-shot" long videos, which are highly valued in filmmaking for immersive storytelling but traditionally require costly production and physical constraints. While recent video diffusion models offer potential, prior approaches relying on first-last frame conditioning fail to ensure temporal coherence, often producing disjointed transitions due to limitations in latent space representation of intermediate frames, semantic shifts between keyframes, and prohibitive computational demands for extended durations. To overcome these, the authors introduce DreaMontage, which implements three key innovations: an intermediate-conditioning mechanism with Shared-RoPE and Adaptive Training for precise frame-level control, Supervised Fine-Tuning with Differentiable Prompt Optimization on curated datasets to enhance visual continuity and reduce abrupt cuts, and a Segment-wise Auto-Regressive inference strategy that enables memory-efficient long-video generation while maintaining narrative integrity.

Dataset

The authors describe their Visual Expression SFT dataset as follows:

  • Composition and sources: The dataset comprises newly collected, category-balanced video samples specifically targeting model weaknesses. It originates from a fine-grained analysis of underperforming cases, structured using a hierarchical taxonomy.
  • Key subset details: The data spans five major classes (Camera Shots, Visual Effects, Sport, Spatial Perception, Advanced Transitions), each divided into precise subclasses (e.g., "Basic Camera Movements – Dolly In" under Camera Shots; "Generation – Light" under Visual Effects). It is a small-scale collection where videos per subclass were carefully selected for core scenario characteristics and high motion dynamics. Videos are longer (up to 20 seconds) and feature more seamless scene transitions compared to prior adaptive tuning data.
  • Usage in training: The authors apply Supervised Fine-Tuning (SFT) using this dataset directly on the model weights obtained from the previous adaptive tuning stage. They reuse similar training strategies and random condition settings from that prior stage.
  • Processing details: The primary processing distinction is the intentional collection of longer-duration videos emphasizing motion dynamics and transitions. No specific cropping strategies or metadata construction beyond the hierarchical classification and selection criteria are mentioned.

Method

The authors leverage a DiT-based video generation framework, extending it with a novel intermediate-conditioning mechanism and a progressive training pipeline to enable arbitrary-frame guided synthesis of long, cinematic one-shot videos. The overall architecture, as shown in the figure below, is structured around two core components: a training pipeline that incrementally refines the model’s capabilities, and an inference pipeline that supports flexible, memory-efficient generation.

At the core of the framework is the Interm-Cond Adaptation strategy, which addresses the temporal misalignment inherent in conditioning on arbitrary frames. The VideoVAE encoder performs 2x temporal downsampling, meaning a single frame’s latent representation corresponds to multiple generated frames, leading to imprecise conditioning. As shown in the figure below, the authors resolve this by aligning the training distribution with inference: for single-frame conditions, the frame is re-encoded; for video conditions, subsequent frames are re-sampled from the latent distribution to match the temporal granularity of the target video. This lightweight tuning enables robust arbitrary-frame control without architectural overhaul.

For super-resolution, the authors introduce Shared-RoPE to mitigate flickering and color shifts caused by channel-wise concatenation of conditioning signals. As depicted in the figure below, in addition to channel-wise conditioning, the VAE latent of each reference image is concatenated along the token sequence dimension, with its Rotary Position Embedding (RoPE) set to match the corresponding temporal position. This sequence-wise conditioning ensures spatial-temporal alignment, particularly critical for maintaining fidelity at higher resolutions. For video conditions, Shared-RoPE is applied only to the first frame to avoid computational overhead.

To enhance visual expressiveness and temporal coherence, the authors implement a Visual Expression SFT stage using a manually curated high-quality dataset. This is followed by a Tailored DPO training phase, which targets two specific failure modes: abrupt cuts and physically implausible subject motion. As shown in the figure below, two distinct pipelines generate contrastive preference pairs. Pipeline A uses a trained VLM discriminator to automatically select “best” and “worst” videos from groups generated with the same prompt but different seeds, focusing on cut severity. Pipeline B relies on human annotation to identify problematic subject motions, generating pairs that guide the model toward physically plausible dynamics. The DPO objective directly optimizes the policy πθ\pi_{\theta}πθ against a reference model πref\pi_{\text{ref}}πref:

LDPO=E(c,vw,vl)D[logσ(βlogπθ(vwc)πref(vwc)βlogπθ(vlc)πref(vlc))]\mathcal{L}_{\mathrm{DPO}} = -\mathbb{E}_{(c, v_{w}, v_{l}) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_{\theta}(v_{w}|c)}{\pi_{\mathrm{ref}}(v_{w}|c)} - \beta \log \frac{\pi_{\theta}(v_{l}|c)}{\pi_{\mathrm{ref}}(v_{l}|c)} \right) \right]LDPO=E(c,vw,vl)D[logσ(βlogπref(vwc)πθ(vwc)βlogπref(vlc)πθ(vlc))]

where ccc denotes the conditioning inputs, and β\betaβ controls the deviation from the reference policy.

For long-form generation, the authors design a Segment-wise Auto-Regressive (SAR) inference strategy. The target video is partitioned into consecutive segments using a sliding window in the latent space, with user-provided conditions acting as candidate boundaries. Each segment sn\mathfrak{s}_nsn is generated conditionally on the tail latents of the previous segment τ(sn1)\tau(\mathfrak{s}_{n-1})τ(sn1) and the local conditions Cn\mathcal{C}_nCn:

sn=Gθ(τ(sn1),Cn)\mathfrak{s}_{n} = \mathcal{G}_{\theta} \left( \tau(\mathfrak{s}_{n-1}), \mathcal{C}_{n} \right)sn=Gθ(τ(sn1),Cn)

where Cn={cn(1),,cn(m)}\mathcal{C}_n = \{c_n^{(1)}, \ldots, c_n^{(m)}\}Cn={cn(1),,cn(m)} represents the heterogeneous conditions within the current window. This auto-regressive mechanism ensures pixel-level continuity across segment boundaries. Overlapping latents are fused before decoding, yielding a temporally coherent long video. The entire process operates in the latent space, avoiding pixel-level artifacts and leveraging the model’s learned consistency from prior training stages.

Experiment

  • Demonstrated arbitrary frame-guided one-shot video generation through qualitative examples, showing coherent narrative transitions across complex scenarios like train-to-cyberpunk shifts and eye-to-meadow sequences without morphing artifacts.
  • In multi-keyframe conditioning, achieved 15.79% higher overall preference than Vidu Q2 and 28.95% over Pixverse V5, with significant gains in prompt following (+23.68%) while maintaining competitive motion and visual quality.
  • In first-last frame conditioning, surpassed Kling 2.5 by 3.97% in overall preference with consistent improvements in motion effects and prompt following (+4.64% each), matching visual fidelity of top-tier models.

The authors use ablation studies to isolate the impact of key optimizations in DreaMontage, showing that combining SFT with DPO improves motion handling and overall performance, while Shared-RoPE delivers the largest gain in visual quality. Results show that adaptive training alone boosts motion and prompt following without affecting visual fidelity, and Shared-RoPE significantly enhances visual quality over its base variant. The cumulative effect of these optimizations leads to substantial overall performance gains across multiple metrics.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp