Command Palette
Search for a command to run...
SkyReels-V4: 다중 모달 영상-오디오 생성, 인페인팅 및 편집 모델
SkyReels-V4: 다중 모달 영상-오디오 생성, 인페인팅 및 편집 모델
초록
SkyReels V4는 영상과 오디오의 공동 생성, 인페인팅, 편집을 위한 통합형 다중 모달 영상 기반 모델입니다. 이 모델은 영상 생성을 담당하는 한 쪽 분기와 시간적으로 정렬된 오디오를 생성하는 다른 쪽 분기를 가진 이중 스트림 다중 모달 확산 변환기(Multimodal Diffusion Transformer, MMDiT) 아키텍처를 채택하였으며, 다중 모달 대규모 언어 모델(Multimodal Large Language Models, MMLM) 기반의 강력한 텍스트 인코더를 공유합니다. SkyReels V4는 텍스트, 이미지, 영상 클립, 마스크, 오디오 참조 등 다양한 다중 모달 지시를 수용할 수 있습니다. MMLM의 다중 모달 지시 따르기 능력과 영상 분기 MMDiT 내의 컨텍스트 기반 학습을 결합함으로써, 복잡한 조건 하에서도 세밀한 시각적 안내를 삽입할 수 있으며, 오디오 분기 MMDiT는 오디오 참조를 활용하여 음향 생성을 안내합니다. 영상 측면에서는 채널 연결( channel concatenation) 형식을 도입하여 이미지에서 영상 생성, 영상 확장, 다양한 조건 하의 영상 편집 등 광범위한 인페인팅 스타일 작업을 단일 인터페이스로 통합하고, 다중 모달 프롬프트를 통해 시각 기반 인페인팅 및 편집으로 자연스럽게 확장됩니다. SkyReels V4는 최대 1080p 해상도, 32FPS, 15초 길이를 지원하여 동영상과 동기화된 오디오를 포함한 고해상도, 다중 장면, 영화 수준의 영상 생성이 가능합니다. 이러한 고해상도, 장시간 생성을 계산적으로 실현 가능하게 하기 위해, 저해상도 전체 시퀀스와 고해상도 키프레임의 동시 생성을 수행한 후, 전용 초해상도화 및 프레임 보간 모델을 통해 후처리하는 효율성 전략을 도입하였습니다. 저희 지식에 따르면, SkyReels V4는 다중 모달 입력을 동시에 지원하고, 영상과 오디오의 공동 생성, 생성, 인페인팅, 편집의 통합 처리를 동시에 수행할 수 있는 최초의 영상 기반 모델이며, 영화 수준의 해상도와 길이에서도 뛰어난 효율성과 품질을 유지하고 있습니다.
One-sentence Summary
SkyReels Team and Skywork AI propose SkyReels-V4, a unified multi-modal video foundation model using dual-stream MMDiT architecture with shared MMLM text encoder, enabling joint video-audio generation, editing, and inpainting at 1080p/32fps/15s via efficient low-high resolution generation, setting a new standard for cinematic multi-modal content creation.
Key Contributions
- SkyReels-V4 introduces a dual-stream MMDiT architecture that jointly generates synchronized video and audio from diverse inputs—including text, images, video clips, masks, and audio references—by leveraging a shared MMLM text encoder for unified multi-modal conditioning.
- The model unifies generation, inpainting, and editing via a channel-concatenation formulation in the video branch, enabling tasks like image-to-video, video extension, and mask-guided editing under a single interface, while the audio branch uses reference audio to guide sound synthesis.
- It achieves cinematic quality (1080p, 32 FPS, 15s) through an efficiency strategy of generating low-res sequences and high-res keyframes followed by super-resolution and interpolation, and outperforms state-of-the-art models on benchmarks including SkyReels-VABench and Artificial Analysis Arena.
Introduction
The authors leverage recent advances in multimodal diffusion modeling to address the fragmentation in video-audio generation systems, where prior models either handled modalities separately or lacked unified editing and inpainting capabilities. Existing approaches—whether commercial like Sora-2 or open-source like Kling-Omni—struggle with full audio-visual alignment, multimodal conditioning, or scalable editing under a single architecture, often sacrificing synchronization, resolution, or flexibility. SkyReels-V4 introduces a dual-stream MMDiT framework that jointly generates synchronized video and audio from diverse inputs (text, images, video, masks, audio) while unifying generation, inpainting, and editing through a channel-concatenation paradigm. It further enables cinematic-scale outputs (1080p, 32 FPS, 15s) via an efficient low/high-resolution keyframe strategy, making it the first system to integrate all these capabilities at production-grade quality and speed.
Dataset
-
The authors use a multimodal training dataset combining real-world and synthetic data across images, videos, and audio.
-
Real-world data comes from public sources (LAION, Flickr, WebVid-10M, Koala-36M, OpenHumanVid, Emilia, AudioSet, VGGSound, SoundNet) and licensed in-house content (movies, TV series, short videos, web series).
-
Synthetic data fills gaps in multilingual text generation, speech synthesis, and multimodal editing. Text generation covers Chinese, English, Japanese, Korean, German, French, etc., with font-aware rendering and context-aware styling. Video-text data includes motion-matched text effects. Speech data uses multiple TTS models and rare-script corpora. Inpainting/editing data is built via segmentation, editing, and controllable generation pipelines.
-
Image processing includes deduplication, quality filtering (resolution, IQA, watermarks), and balancing via clustering (pretraining) or entity/scene matching (fine-tuning).
-
Audio processing classifies clips into sound effects, music, speech, or singing using Qwen3-Omni; filters by SNR, MOS, clipping, and silence ratio; segments or concatenates clips to 15 seconds; transcribes speech/singing with Whisper; and generates unified captions via Qwen3-Omni.
-
Video processing uses intelligent segmentation (VLM-enhanced TransNet) for narrative coherence, deduplicates via VideoCLIP, filters by basic, content, and motion quality, balances by concept and motion taxonomy, and syncs audio-video via SyncNet (retaining clips with |offset| ≤ 3 and confidence > 1.5, min volume -60 dB).
-
The audio backbone is pretrained from scratch on hundreds of thousands of hours of variable-length speech (up to 15s) to capture speaker traits like pitch and emotion.
-
In supervised fine-tuning, the authors train on 5M multimodal joint generation videos (20% of data), then refine with 1M manually curated high-quality videos to boost motion coherence and audio-visual alignment.
-
For evaluation, they introduce SkyReels-VABench: a 2000+ prompt benchmark testing text-to-video+audio models across languages (Chinese, English), content types (advertising, education, storytelling), subjects, environments, motion dynamics, and audio modalities (speech, singing, SFX, music).
Method
The authors leverage a dual-stream Multimodal Diffusion Transformer (MMDiT) architecture to enable joint video and audio generation, inpainting, and editing under a unified framework. The model processes video and audio modalities through parallel, symmetric branches that share a common text encoder derived from a Multimodal Large Language Model (MMLM). This design allows the system to accept rich multi-modal conditioning signals—including text, images, video clips, masks, and audio references—while maintaining computational efficiency at cinematic resolutions and durations.
Refer to the framework diagram, which illustrates the overall architecture. The input pipeline begins with multi-modal conditioning: visual references (images or video clips) are encoded via a Video-VAE, while audio references are processed through an Audio-VAE. These are combined with noisy latents and spatial-temporal masks via channel concatenation for the video branch, and with text embeddings from the MMLM encoder for both branches. The MMLM encoder produces a unified semantic context that is consumed independently by both video and audio streams through self-attention and cross-attention mechanisms.
Each transformer block in the video and audio branches follows a hybrid design: the initial M layers employ a Dual-Stream configuration where video/audio and text tokens maintain separate parameters for normalization and projections but interact during joint self-attention. This facilitates strong cross-modal alignment early in the network. The subsequent N layers transition to a Single-Stream architecture that processes concatenated tokens with shared parameters, maximizing computational efficiency. To counteract potential semantic dilution in the single-stream stages, the video branch incorporates an additional text cross-attention layer after self-attention, reinforcing textual guidance throughout generation.
Bidirectional cross-attention between video and audio streams is embedded in every transformer block, enabling continuous temporal synchronization. The audio stream attends to video features, and vice versa, ensuring that generated audio remains temporally aligned with visual content. Despite differing temporal resolutions—21 video frames versus 218 audio tokens—the authors apply Rotary Positional Embeddings (RoPE) with a frequency scaling factor of 21/218≈0.09633 to the audio tokens, aligning their temporal structure with the video stream.
Training proceeds under a flow matching objective, where the model predicts the velocity field that guides noisy latents toward clean data. The loss function jointly optimizes both video and audio branches:
Lflow=Et,zv0,za0,ϵv,ϵa[vθv(t,Zvt,Zat,c)−(zv0−ϵv)2+vθa(t,Zat,Zvt,c)−(za0−ϵa)2]where c includes multi-modal embeddings and optional masks.
For video inpainting and editing, the authors adopt a channel-concatenation formulation that unifies diverse tasks—including text-to-video, image-to-video, video extension, and spatiotemporal editing—under a single interface. The input to the video MMDiT is formed as:
Zinput=Concat(V,I,M)where V is the noisy video latent, I contains VAE-encoded conditional frames, and M is a binary mask indicating regions to be generated (0) or preserved (1). This mechanism is applied exclusively to the video stream; the audio branch generates synchronized audio from scratch conditioned on the (partially edited) video content.
To achieve high-resolution, long-duration generation efficiently, the authors introduce a cascaded Refiner module that performs joint video super-resolution and frame interpolation. As shown in the figure below, the Refiner accepts low-resolution full sequences and high-resolution keyframes from the base model, along with multi-modal conditioning signals. It employs Video Sparse Attention (VSA) to reduce computational cost by approximately 3× while preserving quality, enabling practical inference at 1080p and 32 FPS.
The Refiner is initialized from the pre-trained video generation model and trained under the same flow matching paradigm. It supports both unconditional enhancement and conditional inpainting via a spatial mask that guides refinement only in target regions. This design enables the model to handle complex editing scenarios—including watermark removal, subject manipulation, and global style transfer—while maintaining temporal coherence and acoustic synchronization.
Experiment
- Ranked third on the Artificial Analysis public leaderboard for text-to-video with audio, indicating strong user-preferred audiovisual synthesis among top models.
- Achieved highest overall score in human evaluations across five dimensions: Instruction Following, Audio-Visual Synchronization, Visual Quality, Motion Quality, and Audio Quality, with standout performance in Instruction Following and Motion Quality.
- Outperformed major baselines (Veo 3.1, Kling 2.6, Seedance 1.5 Pro, Wan 2.6) in pairwise comparisons, consistently rated “Good” more often across most evaluation dimensions.
- Demonstrated practical multimodal editing capabilities including subject insertion, attribute modification, background replacement, and reference-guided synthesis, validated through real-world application examples.