منذ 6 ساعات

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein

جدول المحتويات

الملخص

يواجه التوسع في إنشاء مقاطع الفيديو من ثوانٍ إلى دقائق عقبة حاسمة: بينما تكون بيانات الفيديو القصيرة وافرة وعالية الدقة، فإن بيانات الفيديو الطويلة المتماسكة نادرة، ومحصورة في مجالات ضيقة. ولحل هذه المشكلة، نقترح نموذج تدريب يجمع بين "البحث عن النمط" و"البحث عن المتوسط"، حيث يتم فصل الدقة المحلية عن الاتساق الطويل الأمد بناءً على تمثيل موحد عبر نموذج التحويل التبادلي المُفصَّل. يستخدم نهجنا رأسًا لمحاكاة التدفق العالمي، يتم تدريبه باستخدام التعلم المراقب على مقاطع فيديو طويلة لالتقاط البنية السردية، في حين يُستخدم في الوقت نفسه رأس مطابقة التوزيع المحلي الذي يُعدّل النوافذ المتزاحمة مع معلم قصير المدى ثابت عبر انحراف كولبوج-كولبوج العكسي الموجه نحو النمط. يمكّن هذا الاستراتيجية من إنشاء مقاطع فيديو بطول دقيقة، حيث يتعلم الاتساق الطويل المدى والحركة من موارد محدودة من مقاطع الفيديو الطويلة عبر محاكاة التدفق المراقبة، في حين يرث الواقعية المحلية من خلال مطابقة كل قطعة من النافذة المتحركة للطالب مع معلم قصير المدى ثابت، ما يؤدي إلى مولد فيديو طويل سريع يعتمد على عدد قليل من الخطوات. أظهرت التقييمات أن طريقة العمل لدينا تُغلق بفعالية الفجوة بين الدقة والبعد الزمني من خلال تحسين مشترك للوضوح المحلي، والحركة، والاتساق الطويل الأمد. موقع المشروع: https://primecai.github.io/mmm/.

One-sentence Summary

Researchers from Stanford and NVIDIA propose a Decoupled Diffusion Transformer that combines supervised flow matching for long-term coherence and reverse-KL alignment to short-video teachers for local realism, enabling fast, minute-scale video generation with improved fidelity and temporal consistency.

Key Contributions

We introduce a decoupled training paradigm that aligns sliding-window segments of a long-video generator to a frozen short-video teacher via mode-seeking reverse-KL divergence, preserving local fidelity without additional short-video data.
Our Decoupled Diffusion Transformer uses separate Flow Matching and Distribution Matching heads to jointly learn long-range narrative structure from limited long videos and local realism from the teacher, both decoded from a shared representation.
By leveraging only the Distribution Matching head at inference, we enable fast few-step generation of minute-scale videos that maintain sharp local motion and long-range consistency, effectively closing the fidelity–horizon gap.

Introduction

The authors leverage a decoupled training paradigm to tackle the challenge of generating minute-scale videos, where long-term coherence and local fidelity are typically at odds due to data scarcity. Prior methods that mix short and long videos assume temporal scaling is like spatial resolution scaling — an analogy the authors debunk, showing that long videos require extrapolating new events and causal structures, not just interpolating frames. Existing approaches either sacrifice local sharpness for longer duration or rely on expensive, scarce long-video datasets. Their key contribution is a Decoupled Diffusion Transformer with two heads: a Flow Matching head trained on real long videos to learn global narrative structure, and a Distribution Matching head that aligns sliding-window segments to a frozen short-video teacher using mode-seeking reverse-KL divergence — enabling fast, few-step inference while preserving both local realism and long-range consistency.

Dataset

The authors use data from multiple sources: all videos from the Sekai dataset and a filtered subset from MiraData, plus internet videos with single-shot filtering.
The combined dataset spans over 100k videos, each 10 seconds to several minutes long, averaging 31 seconds per clip.
Videos longer than 61 seconds are temporally subsampled to meet the upper bound.
The data is used as-is for training, with no mention of mixture ratios or additional splits—processing focuses on uniform length via subsampling.

Method

The authors leverage a decoupled architecture to reconcile the competing objectives of long-horizon coherence and local realism in video generation. The core design centers on a shared condition encoder that processes noisy long-video latents and feeds two distinct decoder heads, each optimized for a separate training signal. This structure enables the model to simultaneously learn global temporal structure from scarce long videos and preserve high-fidelity local dynamics via alignment with a short-video teacher.

The condition encoder $E_\phi$ ingests a noisy long-video latent $x_t^{\text{long}}$ , along with timestep $t$ and conditioning $c$ , to produce a spatiotemporal feature tensor $h_t$ . Architecturally, $E_\phi$ is implemented as a video diffusion transformer with full-range temporal attention, forming the backbone that both heads share. This shared representation ensures that long-context features are learned and reused across objectives, promoting consistency between global and local modeling.

On top of $h_t$ , two lightweight transformer decoders are attached. The first, $D_\theta^{\text{FM}}$ , parameterizes the Flow Matching (FM) head, which outputs the global velocity field $u_\theta(x_t^{\text{long}}, t, c)$ . This head is trained via supervised flow matching on real long videos, minimizing the mean-squared error between predicted velocity and the ground-truth marginal velocity $x_0^{\text{long}} - z^{\text{long}}$ . This objective anchors the model to real long-video trajectories, encouraging minute-scale temporal coherence and narrative structure.

The second head, $D_\psi^{\text{DM}}$ , implements the Distribution Matching (DM) objective. It outputs a local velocity field $v_\psi(x_t^{\text{long}}, t, c)$ , which is used to generate sliding-window segments. These segments are aligned with an expert short-video teacher via a reverse-KL loss, implemented through a DMD/VSD-style gradient surrogate. Specifically, for each window $k$ , the model crops the predicted velocity and compares it against the teacher’s velocity on the corresponding noised window. The gradient is computed as the difference between the student’s “fake” score estimator and the teacher’s velocity, scaled by a time-dependent weight $\lambda(t)$ , and backpropagated only through the generated window $\hat{x}_0^{(k)}$ . This mode-seeking signal encourages the student to concentrate on high-fidelity local modes of the teacher, preserving short-horizon realism without requiring access to the teacher’s training data.

Refer to the framework diagram, which illustrates how the shared encoder $E_\phi$ feeds both the mean-seeking FM head and the mode-seeking DM head. The FM head is supervised by ground-truth long videos, while the DM head is regularized by the short-video teacher through sliding-window comparisons. This decoupling allows each head to specialize: the FM head learns global dynamics, and the DM head refines local quality.

Experiment

Validated that SFT-only methods (LongSFT, MixSFT) establish basic temporal coherence but suffer from blurriness and loss of fine detail due to data scarcity and averaging effects.
Confirmed that teacher-only methods (CausVid, Self-Forcing, InfinityRoPE) preserve local realism initially but degrade over time due to error accumulation and lack of long-context grounding, often resulting in static or overly conservative motion.
Demonstrated that the proposed method outperforms baselines by decoupling global long-context learning (via SFT) from local fidelity alignment (via teacher distribution matching), achieving superior motion smoothness, scene consistency, and visual quality.
Ablation studies confirmed the necessity of the dual-head architecture, sliding-window teacher matching, and long-video SFT — each component uniquely contributes to long-horizon coherence and short-term realism without gradient interference.
Qualitative and quantitative evaluations across multiple metrics and Gemini-3-Pro assessments consistently support the method’s ability to generate temporally coherent, visually rich, and narratively stable long videos.

The authors use a decoupled dual-head architecture to separately handle long-range consistency and short-range fidelity, combining supervised fine-tuning on real long videos with sliding-window teacher distribution matching. Results show that removing any of these components degrades performance, confirming that both global structure learning and local texture preservation are essential for high-quality long video generation. The full model achieves the best balance across consistency, motion, and quality, outperforming ablated variants and baseline methods.

The authors use a decoupled dual-head architecture combining supervised fine-tuning on long videos with local teacher distribution matching to generate minute-scale videos. Results show this approach outperforms both SFT-only and teacher-only baselines by maintaining long-range narrative coherence while preserving high-fidelity motion and visual detail. Ablation studies confirm that each component—global SFT, local mode-seeking, and architectural separation—is essential for the observed gains in consistency and quality.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ 6 ساعات

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

We introduce a decoupled training paradigm that aligns sliding-window segments of a long-video generator to a frozen short-video teacher via mode-seeking reverse-KL divergence, preserving local fidelity without additional short-video data.
Our Decoupled Diffusion Transformer uses separate Flow Matching and Distribution Matching heads to jointly learn long-range narrative structure from limited long videos and local realism from the teacher, both decoded from a shared representation.
By leveraging only the Distribution Matching head at inference, we enable fast few-step generation of minute-scale videos that maintain sharp local motion and long-range consistency, effectively closing the fidelity–horizon gap.

Introduction

Dataset

The authors use data from multiple sources: all videos from the Sekai dataset and a filtered subset from MiraData, plus internet videos with single-shot filtering.
The combined dataset spans over 100k videos, each 10 seconds to several minutes long, averaging 31 seconds per clip.
Videos longer than 61 seconds are temporally subsampled to meet the upper bound.
The data is used as-is for training, with no mention of mixture ratios or additional splits—processing focuses on uniform length via subsampling.

Method

Experiment

Validated that SFT-only methods (LongSFT, MixSFT) establish basic temporal coherence but suffer from blurriness and loss of fine detail due to data scarcity and averaging effects.
Confirmed that teacher-only methods (CausVid, Self-Forcing, InfinityRoPE) preserve local realism initially but degrade over time due to error accumulation and lack of long-context grounding, often resulting in static or overly conservative motion.
Demonstrated that the proposed method outperforms baselines by decoupling global long-context learning (via SFT) from local fidelity alignment (via teacher distribution matching), achieving superior motion smoothness, scene consistency, and visual quality.
Ablation studies confirmed the necessity of the dual-head architecture, sliding-window teacher matching, and long-video SFT — each component uniquely contributes to long-horizon coherence and short-term realism without gradient interference.
Qualitative and quantitative evaluations across multiple metrics and Gemini-3-Pro assessments consistently support the method’s ability to generate temporally coherent, visually rich, and narratively stable long videos.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

البحث عن النمط يلتقي بالبحث عن المتوسط لتحقيق توليد سريع للفيديوهات الطويلة

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein1 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

البحث عن النمط يلتقي بالبحث عن المتوسط لتحقيق توليد سريع للفيديوهات الطويلة

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein1 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

البحث عن النمط يلتقي بالبحث عن المتوسط لتحقيق توليد سريع للفيديوهات الطويلة

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein1 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein