HyperAIHyperAI

Command Palette

Search for a command to run...

Console
4日前

スケーリングによるゼロショット参照から動画生成

スケーリングによるゼロショット参照から動画生成

要約

参照画像から動画生成(Reference-to-video: R2V)は、テキストプロンプトに一致する動画を合成しつつ、参照画像から得られる被写体のアイデンティティを維持することを目的としています。しかし、現在のR2V手法は、明示的な参照画像・動画・テキストの三つ組(triplet)データに依存しており、その構築は極めて高コストであり、スケーラビリティに難があるという課題に直面しています。本研究では、このボトルネックを回避するため、明示的なR2Vデータを一切不要とするスケーラブルなゼロショットフレームワーク「Saber」を提案します。Saberは、動画とテキストのペアのみを用いて訓練され、マスク付き学習戦略と特化されたアテンションベースのモデル設計により、アイデンティティの一貫性と参照情報への敏感性を兼ね備えた表現を学習します。さらに、参照画像からのコピーペーストアーティファクトを軽減するため、マスク増強(mask augmentation)技術を統合しています。また、参照画像の数が異なる状況においても優れた汎化性能を示し、R2Vデータで訓練された手法と比較して、OpenS2V-Evalベンチマークにおいて優れた性能を達成しています。

Summarization

Researchers from Meta AI and King's College London propose Saber, a zero-shot reference-to-video generation framework that eliminates the need for costly reference image-video-text triplets. Trained solely on video-text pairs, Saber uses masked training, attention-based modeling, and mask augmentation to achieve strong identity preservation and outperform existing methods on OpenS2V-Eval.

Key Contributions

  • Saber introduces a zero-shot framework for reference-to-video generation that eliminates the need for costly, manually constructed reference image-video-text triplet datasets by training exclusively on video-text pairs.
  • The method proposes a masked training strategy with attention-based modeling and mask augmentations, enabling the model to learn identity-consistent, reference-aware representations while reducing copy-paste artifacts.
  • Saber achieves state-of-the-art performance on the OpenS2V-Eval benchmark, outperforming models trained on explicit R2V data, and demonstrates strong generalization across varying numbers of reference images and diverse subject categories.

Introduction

Reference-to-video (R2V) generation aims to create videos that align with a text prompt while preserving the identity and appearance of subjects from reference images, enabling personalized applications like custom storytelling and virtual avatars. Prior methods rely on costly, manually constructed datasets of image-video-text triplets, which limit scalability and diversity, leading to poor generalization on unseen subjects. The authors leverage a zero-shot framework called Saber that eliminates the need for such specialized data by training exclusively on large-scale video-text pairs, using a masked training strategy to simulate reference conditions.

Key innovations include:

  • A masked frame training approach that randomly samples and masks video frames as synthetic reference images, enabling data-efficient, scalable learning without curated R2V datasets
  • A guided attention mechanism with attention masks to enhance focus on reference-aware features and suppress background distractions
  • Spatial mask augmentations that improve visual fidelity and reduce copy-paste artifacts in generated videos while supporting flexible, multi-reference inputs

Dataset

  • The authors use video-text pairs constructed from the ShutterStock Video dataset, where captions are generated for all video clips using Qwen2.5-VL-Instruct.
  • The training data consists solely of these synthesized video-text pairs, enabling the use of both text-to-video and image-to-video sources through masked training.
  • No additional datasets are used; all training data is derived from ShutterStock Video with automatically generated captions.
  • The model is fine-tuned from Wan2.1-14B on this video-text data using a masked training strategy with probabilistic mask sampling.
  • For mask generation, the foreground area ratio r is sampled: 10% of the time from [0, 0.1] (minimal reference), 80% from [0.1, 0.5] (typical subjects), and 10% from [0.5, 1.0] (large or background references).
  • Mask augmentation includes random rotation (±10°), scaling (0.8–2.0), horizontal flipping (50% probability), and shearing (±10°) to reduce copy-paste artifacts.
  • During inference, BiRefNet segments foreground subjects from reference images to generate masks.
  • The model is trained with AdamW, a learning rate of 1e−5, and a global batch size of 64, following the denoising objective in the paper.
  • Videos are generated with 50 denoising steps and a CFG guidance scale of 5.0, consistent with the Wan2.1 setup.

Method

The authors leverage a training paradigm that simulates reference-to-video generation without requiring explicit reference-image-video-text triplets. Instead, they dynamically generate masked frames from video frames during training, which serve as synthetic reference inputs. This approach enables the model to learn identity and appearance preservation from diverse, in-the-wild video content, circumventing the limitations of curated R2V datasets.

Refer to the framework diagram illustrating masked reference generation. For each randomly sampled frame Ik\mathbf{I}_kIk from a video, a mask generator produces a binary mask Mk\mathbf{M}_kMk with controllable foreground area ratio r[rmin,rmax]r \in [r_\text{min}, r_\text{max}]r[rmin,rmax]. The mask generator selects from predefined shape categories—such as ellipses, Fourier blobs, or convex/concave polygons—and employs a bisection search over a continuous scale parameter to meet the target area ratio. Topology-preserving adjustments (growth or shrinkage) are applied if pixel discretization prevents exact matching. The resulting mask is then paired with the frame and subjected to identical spatial augmentations—including rotation, scaling, shear, translation, and optional horizontal flip—to disrupt spatial correspondence and mitigate copy-paste artifacts. The augmented mask Mˉk\bar{\mathbf{M}}_kMˉk is applied to the augmented frame Iˉk\bar{\mathbf{I}}_kIˉk to produce the masked frame I^k=IˉkMˉk\hat{\mathbf{I}}_k = \bar{\mathbf{I}}_k \odot \bar{\mathbf{M}}_kI^k=IˉkMˉk. This process is repeated to generate a set of KKK masked frames {I^k}k=1K\{ \hat{\mathbf{I}}_k \}_{k=1}^K{I^k}k=1K that serve as the reference condition during training.

The model architecture builds upon the Wan2.1-14B framework, which includes a VAE, a transformer backbone, and a text encoder. The VAE encodes both the target video and the masked reference frames into latent space, producing z0\mathbf{z}_0z0 and zref\mathbf{z}_\text{ref}zref, respectively. The input to the transformer is constructed by concatenating the noisy video latent zt\mathbf{z}_tzt, the reference latents zref\mathbf{z}_\text{ref}zref, and corresponding mask latents mref\mathbf{m}_\text{ref}mref along the temporal dimension. The mask latents are derived by resizing the binary masks Mk\mathbf{M}_kMk to latent resolution and are concatenated with zero masks mzero\mathbf{m}_\text{zero}mzero to match the temporal length of the video part. The full input zin\mathbf{z}_\text{in}zin is formed by channel-wise concatenation of three components: the concatenated latents, the concatenated masks, and the concatenated zero latents (to preserve conditioning fidelity). This input format is defined as:

zin=cat[cat[ztzref]temporalcat[mzeromref]temporalcat[zzerozref]temporal]channel\mathbf{z}_{\text{in}} = \mathsf{cat} \begin{bmatrix} \mathsf{cat}[ \mathbf{z}_{t} & \mathbf{z}_\text{ref} ]_\text{temporal} \\ \mathsf{cat}[ \mathbf{m}_\text{zero} & \mathbf{m}_\text{ref} ]_\text{temporal} \\ \mathsf{cat}[ \mathbf{z}_\text{zero} & \mathbf{z}_\text{ref} ]_\text{temporal} \end{bmatrix}_\text{channel}zin=catcat[ztcat[mzerocat[zzerozref]temporalmref]temporalzref]temporalchannel

As shown in the figure below, the transformer processes this input through a sequence of blocks, each comprising self-attention, cross-attention, and feed-forward modules. In self-attention, video and reference tokens interact under an attention mask derived from the resized masks Mk\mathbf{M}_kMk, ensuring that only valid reference regions are attended. Cross-attention then integrates text features zP\mathbf{z}_\text{P}zP, guiding video tokens with the text prompt while aligning reference tokens semantically. The time step ttt is injected into the latents to modulate the diffusion process. The output of the transformer is the predicted latent zt1\mathbf{z}_{t-1}zt1, which is decoded by the VAE to generate the next video frame.

Experiment

  • Evaluated on OpenS2V-Eval benchmark, Saber achieves 57.91% total score in a zero-shot setting, surpassing closed-source methods like Kling1.6 (+1.68%) and explicitly trained R2V models including Phantom-14B (+1.14%), VACE-14B (+0.36%), and BindWeave (+0.30%).
  • Saber attains the highest NexusScore (47.22%), outperforming Phantom by 9.79%, VACE by 3.14%, and BindWeave by 0.36%, demonstrating superior subject consistency without using dedicated R2V training data.
  • Ablation studies show masked training improves total score by +1.67%, while diverse mask generation and augmentation are critical for generalization and natural composition, reducing copy-paste artifacts.
  • The attention mask effectively prevents visual artifacts and enhances subject separation and blending by constraining reference-video token interactions.
  • Qualitative results confirm Saber accurately generates videos with consistent identities across single and multiple human/object references, outperforming Kling1.6, Phantom, and VACE in identity preservation and coherence.
  • Emergent abilities include handling multiple views of a single subject and robust cross-modal alignment between reference images and text prompts, enabling accurate attribute and positional control.

The authors use a zero-shot approach trained on video-text pairs to achieve the highest total score and NexusScore on the OpenS2V-Eval benchmark, outperforming both closed-source and explicitly trained R2V methods. Results show Saber excels in subject consistency and maintains competitive performance on text-video alignment and naturalness, validating its effectiveness without relying on dedicated R2V datasets.

The authors evaluate Saber’s ablation components on the OpenS2V-Eval benchmark, showing that removing masked training reduces the total score by 1.67% and significantly lowers NexusScore, indicating weaker subject consistency. Using only one mask type or fixing the foreground ratio degrades performance, confirming that mask diversity and adaptive masking are critical for generalization and video quality.

AI で AI を構築

アイデアからローンチまで — 無料の AI 共同コーディング、すぐに使える環境、最適価格の GPU で AI 開発を加速。

AI 共同コーディング
すぐに使える GPU
最適価格
今すぐ始める

Hyper Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています