HyperAIHyperAI

Command Palette

Search for a command to run...

SAMA: 지시 기반 비디오 편집을 위한 분해된 의미 앵커링 및 운동 정렬

초록

현재 명령어 기반 비디오 편집 모델들은 정확한 의미론적 수정과 충실한 동작 보존을 동시에 균형 있게 수행하는 데 어려움을 겪고 있습니다. 기존 접근법들은 이러한 문제를 완화하기 위해 명시적인 외부 사전 지식 (예: VLM 특징 또는 구조적 조건) 을 주입하는 데 의존하지만, 이러한 의존성은 모델의 강건성과 일반화 능력을 심각하게 저해합니다. 이러한 한계를 극복하기 위해 우리는 SAMA(Semantic Anchoring and Motion Alignment 의 분해) 를 제안합니다. SAMA 는 비디오 편집 작업을 의미론적 앵커링과 동작 모델링으로 분해하는 프레임워크입니다.첫째, 우리는 희소 앵커 프레임에서 의미론적 토큰과 비디오 잠재 변수를 공동 예측하여 신뢰할 수 있는 시각적 앵커를 확립하는 'Semantic Anchoring'을 도입합니다. 이를 통해 명령어 인식 구조 계획만을 가능하게 합니다. 둘째, 'Motion Alignment'는 입방체 인페인팅, 속도 교란, 튜브 셔플과 같은 동작 중심의 비디오 복원 프락시 태스크에 동일한 백본을 사전 학습시켜, 모델이 원시 비디오에서 직접 시간적 역학을 내재화하도록 합니다.SAMA 는 두 단계 파이프라인으로 최적화됩니다. 첫 번째 단계는 쌍을 이루는 비디오 - 명령어 편집 데이터 없이 고유한 의미론적 - 동작 표현을 학습하는 분해된 사전 학습 단계이며, 이어지는 두 번째 단계는 쌍을 이루는 편집 데이터에 대한 지도 미세 조정입니다. 놀랍게도, 분해된 사전 학습만으로도 강력한 제로샷 비디오 편집 능력을 발휘하여 제안된 분해 전략의 유효성을 입증합니다. SAMA 는 오픈소스 모델 간에 최첨단 성능을 달성했을 뿐만 아니라, 선두의 상용 시스템 (예: Kling-Omni) 과도 경쟁 가능한 성능을 보입니다. 코드, 모델 및 데이터셋은 공개될 예정입니다.

One-sentence Summary

Researchers from Baidu, Tsinghua University, and other institutions present SAMA, a framework that factorizes video editing into semantic anchoring and motion alignment. By pre-training on motion-centric restoration tasks without paired data, SAMA achieves state-of-the-art zero-shot performance while avoiding the robustness bottlenecks of prior methods relying on external priors.

Key Contributions

  • The paper introduces SAMA, a framework that factorizes video editing into semantic anchoring and motion modeling to reduce reliance on explicit external priors like VLM features or structural conditions.
  • Semantic Anchoring establishes reliable visual anchors by jointly predicting semantic tokens and video latents at sparse frames, while Motion Alignment pre-trains the backbone on motion-centric restoration tasks to internalize temporal dynamics from raw videos.
  • Experiments demonstrate that the proposed two-stage training pipeline yields strong zero-shot editing capabilities and achieves state-of-the-art performance among open-source models, competing with leading commercial systems.

Introduction

Instruction-guided video editing aims to apply fine-grained semantic changes while preserving the temporal coherence of motion, yet current models struggle to balance these competing demands. Prior approaches often rely on injecting explicit external priors like skeletons or depth maps, which constrains the diffusion backbone from learning inherent semantic-motion representations and leads to artifacts or diluted edits. The authors propose SAMA, a framework that factorizes semantic planning from motion modeling by introducing Semantic Anchoring for instruction-aware structural planning and Motion Alignment to internalize temporal dynamics through motion-centric pre-training. This two-stage strategy enables the model to achieve state-of-the-art performance among open-source systems without heavy reliance on brittle external signals.

Dataset

  • Dataset Composition and Sources: The authors curate a mixed dataset for image and video editing, drawing from NHR-Edit, GPT-image-edit, X2Edit, and Pico-Banana-400K for image editing tasks. For video editing, they utilize Ditto-1M, OpenVE-3M, and ReCo-Data, while incorporating Koala-36M and MotionBench specifically for pretext motion alignment in text-to-video generation.

  • Subset Filtering and Selection: All data undergoes a VLM-based coarse filtering stage using Qwen2.5-VL-72B to score samples on a 1–10 scale across metrics like Instruction Following, Visual Quality, Content Preservation, and Motion Consistency. The authors apply strict thresholds, retaining image samples with scores of 9 or higher for the first three metrics, and video samples with scores of 8 or higher for most metrics and above 8 for Motion Consistency. Specific subsets are selected, including only the Style category from Ditto-1M and the Local Change, Background, Style, and Subtitles categories from OpenVE-3M.

  • Training Strategy and Mixture Ratios: The model undergoes two-stage training on mixed image and video data at 480p resolution with support for multiple aspect ratios. For text-to-video data, the authors employ a sampling ratio of 1:2:3:4 for no-pretext tasks, Cube Inpainting, Speed Perturbation, and Tube Shuffle respectively. Cube Inpainting uses a 30% masking ratio, Speed Perturbation applies 2x temporal acceleration, and Tube Shuffle divides videos into 2x2x2 spatiotemporal tubes for random shuffling.

  • Processing and Configuration Details: During training, the authors uniformly sample N sparse anchor frames for Semantic Anchoring, setting N to 1 for efficiency, and fix the number of local semantic tokens per anchor frame at 64. They maintain an exponential moving average of model parameters with a decay of 0.9998 and set the loss weight lambda to 0.1. Evaluation is conducted on VIE-Bench, OpenVE-Bench, and ReCo-Bench using different VLM judges such as GPT-4o and Gemini-2.5-Pro for scoring.

Method

The authors propose SAMA, a framework built upon the Wan2.1-T2V-14B video diffusion transformer. The core philosophy involves factorizing video editing into semantic anchoring and motion modeling to balance precise semantic modifications with faithful motion preservation. The overall architecture and training pipeline are illustrated in the framework diagram below.

The method encodes source and target videos into VAE latents, denoted as zs\mathbf{z}_szs and zt\mathbf{z}_tzt. These are concatenated to form an in-context V2V input z=[zs;zt]\mathbf{z} = [\mathbf{z}_s; \mathbf{z}_t]z=[zs;zt]. To distinguish token roles, learned type embeddings are added: type id 0 for source-video latents, type id 2 for target-video latents, and type id 1 for semantic tokens. This approach is observed to yield faster convergence compared to shifted RoPE schemes.

Semantic Anchoring (SA) establishes reliable visual anchors. For video samples, NNN frames are uniformly sampled as anchor frames. A SigLIP image encoder extracts patch-level semantic features, which are pooled into local and global tokens. These are projected into the VAE latent space via a lightweight MLP. The projected semantic tokens s^\hat{\mathbf{s}}s^ are prepended to the target latent sequence. Both semantic tokens and target latents undergo the forward noising process. The model predicts the semantic tokens s\mathbf{s}s via a head attached to the final DiT layer. The objective minimizes the 1\ell_11 loss: Lsem=s^s1\mathcal{L}_{\mathrm{sem}} = \|\hat{\mathbf{s}} - \mathbf{s}\|_1Lsem=s^s1 The total loss combines flow matching and semantic anchoring: L=LFM+λLsem\mathcal{L} = \mathcal{L}_{\mathrm{FM}} + \lambda \cdot \mathcal{L}_{\mathrm{sem}}L=LFM+λLsem.

Motion Alignment (MA) aligns the edited video with source motion dynamics. It applies motion-centric transformations T\mathcal{T}T only to the source video VsV_sVs to create a perturbed version V~s\tilde{V}_sV~s, while keeping the target video unchanged. This forces the model to learn motion recovery. The specific pretext perturbations are detailed in the figure below.

The three transformations include Cube Inpainting (masking temporal blocks), Speed Perturbation (accelerating playback), and Tube Shuffle (permuting spatio-temporal tubes). Task tokens are prepended to instructions to unify the formulation (e.g., "[Complete the missing regions in the video.]").

SAMA utilizes a two-stage pipeline. Stage 0 is Factorized Pre-training, where the model learns inherent semantic-motion representations without paired editing data. SA is applied to both image and video samples, while MA is applied to the video stream. Stage 1 is Supervised Fine-tuning (SFT) on paired video editing datasets. In this stage, the model aligns generation with paired supervision while keeping SA enabled to maintain stable semantic anchoring.

Experiment

  • Comparisons with state-of-the-art methods validate that SAMA achieves superior overall performance on Swap, Change, and Remove tasks, demonstrating stronger instruction adherence, better handling of fine-grained spatial and attribute constraints, and improved temporal consistency compared to existing models.
  • Zero-shot evaluation confirms the model can perform diverse editing tasks without specific training data, though it exhibits limitations such as temporal color inconsistency, blurriness in added objects, and residual ghosting in removal edits.
  • Ablation studies reveal that Semantic Anchoring accelerates model convergence and stabilizes training, while Motion Alignment significantly enhances temporal coherence and reduces motion blur during fast camera movements, with both components proving complementary.
  • Visualization of motion-centric pretext tasks indicates that the model successfully internalizes motion cues and temporal reasoning, which directly supports high-quality instruction-guided video editing.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp