Command Palette
Search for a command to run...
GenCompositor: Generative Video Compositing with Diffusion Transformer
Shuzhou Yang Xiaoyu Li Xiaodong Cun Guangzhi Wang Lingen Li Ying Shan Jian Zhang

Abstract
Video compositing combines live-action footage to create video production,serving as a crucial technique in video creation and film production.Traditional pipelines require intensive labor efforts and expert collaboration,resulting in lengthy production cycles and high manpower costs. To address thisissue, we automate this process with generative models, called generative videocompositing. This new task strives to adaptively inject identity and motioninformation of foreground video to the target video in an interactive manner,allowing users to customize the size, motion trajectory, and other attributesof the dynamic elements added in final video. Specifically, we designed a novelDiffusion Transformer (DiT) pipeline based on its intrinsic properties. Tomaintain consistency of the target video before and after editing, we revised alight-weight DiT-based background preservation branch with masked tokeninjection. As to inherit dynamic elements from other sources, a DiT fusionblock is proposed using full self-attention, along with a simple yet effectiveforeground augmentation for training. Besides, for fusing background andforeground videos with different layouts based on user control, we developed anovel position embedding, named Extended Rotary Position Embedding (ERoPE).Finally, we curated a dataset comprising 61K sets of videos for our new task,called VideoComp. This data includes complete dynamic elements and high-qualitytarget videos. Experiments demonstrate that our method effectively realizesgenerative video compositing, outperforming existing possible solutions infidelity and consistency.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.