HyperAIHyperAI

Command Palette

Search for a command to run...

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Luozheng Qin Jia Gong Qian Qiao Tianjiao Li Li Xu Haoyu Pan Chao Qu Zhiyu Tan Hao Li

Zusammenfassung

Da Sie mich gebeten haben, die Übersetzung gemäß Ihren professionellen Standards durchzuführen, aber die Antwort auf Deutsch zu verfassen, werde ich Ihnen zunächst die hochpräzise chinesische Übersetzung (da dies das Ziel des Textes ist) liefern und anschließend eine kurze Zusammenfassung bzw. Erläuterung der Übersetzung auf Deutsch geben.Hier ist die professionelle chinesische Übersetzung des Textes:中文翻译 (Chinesische Übersetzung)统一的视觉理解与生成多模态模型面临着一个根本性的挑战:视觉生成的计算成本远高于理解任务,尤其是在视频领域。这种不平衡促使我们反转传统的范式:与其通过扩展以理解为核心的 MLLMs 来支持生成,我们提出了 Uni-ViGU——一个通过扩展视频生成器作为基础,从而统一视频生成与理解的框架。我们引入了一种统一的 Flow 方法,在单一流程内分别对视频执行连续 Flow Matching,对文本执行离散 Flow Matching,从而实现了连贯的多模态生成。此外,我们提出了一种模态驱动的基于 MoE 的框架,通过轻量级层增强 Transformer 模块以支持文本生成,同时保留了生成先验(generative priors)。为了将生成知识重新利用于理解任务,我们设计了一种包含两个阶段的双向训练机制:通过“知识召回”(Knowledge Recall)阶段重构输入 prompt,以利用已学习到的文本-视频对应关系;通过“能力精炼”(Capability Refinement)阶段在详细的 caption 上进行微调,从而建立具有判别性的共享表示(shared representations)。实验表明,Uni-ViGU 在视频生成和理解方面均取得了具有竞争力的性能,验证了以生成为核心的架构是实现统一多模态智能的一种可扩展路径。项目主页与代码:https://fr0zencrane.github.io/uni-vigu-page/Erläuterungen zur Übersetzung (Deutsch)Um Ihren Anforderungen als professioneller Übersetzer gerecht zu werden, habe ich bei der Arbeit folgende Aspekte berücksichtigt:Terminologie-Management: Gemäß Ihrer Anweisung wurden Fachbegriffe wie MLLMs, Transformer, prompt und MoE im englischen Original belassen, um die wissenschaftliche Präzision zu gewährleisten. Begriffe wie Flow Matching oder generative priors wurden so übersetzt, dass sie im chinesischen KI-Diskurs (z. B. 连续 Flow Matching für continuous flow matching) fachlich korrekt klingen.Stilistische Anpassung: Der Text wurde in einem formalen, akademischen Stil (Schriftsprache) verfasst. Anstatt einer rein wörtlichen Übersetzung habe ich die Satzstrukturen so umgestellt, dass der logische Fluss (die Kausalität zwischen dem Problem der Rechenkosten und der Lösung durch das neue Paradigma) für chinesische Fachleser natürlich wirkt.Präzision der Konzepte: Der Begriff "Invert the conventional paradigm" wurde mit "反转传统的范式" übersetzt, was den wissenschaftlichen Bruch mit bisherigen Methoden präzise beschreibt.Die zwei Phasen des Trainings (Knowledge Recall und Capability Refinement) wurden sowohl mit einer präzisen chinesischen Entsprechung als auch unter Beibehaltung der Bedeutung der funktionalen Logik übersetzt.Zielsetzung: Die Übersetzung stellt sicher, dass die zentrale These des Papers – dass ein auf Generierung basierender Ansatz (generation-centric) skalierbarer für die einheitliche KI ist – klar und gewichtig hervorgehoben wird.

One-sentence Summary

Uni-ViGU unifies video generation and understanding by extending a diffusion-based video generator through a unified flow matching method and a modality-driven MoE-based architecture, utilizing a two-stage bidirectional training mechanism to repurpose generative priors for discriminative understanding.

Key Contributions

  • The paper introduces Uni-ViGU, a framework that unifies video generation and understanding by extending a pretrained video generator as a foundation to leverage existing spatiotemporal priors.
  • A unified flow formulation is presented that enables coherent multimodal generation by performing continuous flow matching for video and discrete flow matching for text within a single process.
  • The work implements a modality-driven Mixture-of-Experts (MoE) architecture and a bidirectional training mechanism consisting of Knowledge Recall and Capability Refinement to repurpose generative knowledge for discriminative video understanding.

Introduction

Integrating visual understanding and generation into a single model is essential for developing general purpose visual intelligence. Current approaches typically extend understanding centric multimodal large language models to support generation, but this faces massive scalability issues because video generation requires processing millions of tokens through iterative denoising. The authors propose Uni-ViGU, a framework that inverts this paradigm by using a video generator as the foundational architecture. They introduce a unified flow method that combines continuous flow matching for video with discrete flow matching for text within a single process. To enable this, the authors leverage a modality driven MoE based architecture that augments Transformer blocks with lightweight layers for text while preserving generative priors, alongside a bidirectional training mechanism to repurpose learned text to video correspondences for video understanding.

Dataset

Dataset overview
Dataset overview

The authors utilize a meticulously curated dataset of synthesized video-text pairs to train Uni-ViGU through a two-stage bidirectional framework. The dataset details are as follows:

  • Dataset Composition and Sources: The data is synthesized by using state-of-the-art video generators to create videos from a set of initial conditioning prompts. An LLM is then used to analyze each video-prompt pair to generate highly detailed captions that enrich the original prompt's information.
  • Subsets and Training Usage:
    • Stage 1 (Knowledge Recall): The model is trained on 10K video-prompt pairs. In this stage, the target text is identical to the conditioning prompt, though condition dropout is applied to prevent the model from simply copying the input.
    • Stage 2 (Capability Refinement): The model undergoes fine-tuning on an additional 10K video-prompt-detailed caption triples. Here, the model is conditioned on a brief prompt but tasked with generating a semantically precise, detailed caption.
  • Processing and Constraints: To ensure the model develops genuine comprehension rather than trivial inference, the authors enforce strict token-length constraints. Conditioning prompts are limited to 0 to 128 tokens, while detailed captions are restricted to 128 to 256 tokens. This length separation forces the model to rely on the shared attention mechanism to bridge the gap between the brief prompt and the rich description.

Method

The authors leverage the latent diffusion framework of WAN2.1, a state-of-the-art text-to-video generator, as the foundation for their unified model. This framework operates in a compressed latent space, enabling efficient video generation through iterative denoising. The process begins with a video xxx being encoded into a latent representation z1=E(x)z_1 = \mathcal{E}(x)z1=E(x) by a Variational Autoencoder (VAE). The model learns a diffusion process by defining a continuous transport path from Gaussian noise z0z_0z0 to the data latent z1z_1z1 via linear interpolation, zt=(1t)z0+tz1z_t = (1-t)z_0 + tz_1zt=(1t)z0+tz1. A neural network, specifically a Diffusion Transformer (DiT), is trained to predict the velocity field u=z1z0u = z_1 - z_0u=z1z0 conditioned on the text prompt ccc, the intermediate latent ztz_tzt, and the time step ttt, optimizing a flow matching loss. Inference proceeds by integrating this learned velocity field from t=0t=0t=0 to t=1t=1t=1 to generate the final latent, which is then decoded into the output video x^=D(z1)\hat{x} = \mathcal{D}(z_1)x^=D(z1).

WAN DiT Block and Uni-ViGU DiT Block
WAN DiT Block and Uni-ViGU DiT Block

The core architecture of the video generator is a DiT, composed of multiple transformer blocks. Each block processes the input through a sequence of layers: self-attention, cross-attention, and a feed-forward network (FFN). The self-attention layer captures spatial and temporal dependencies within the video features, while the cross-attention layer integrates semantic information from the text prompt ccc, which is used as the key-value pair. The FFN layer performs position-wise transformations. This structure is extended to support a unified text-video generation framework, as shown in the figure below.

Unified Flow Matching
Unified Flow Matching

To unify video and text generation, the authors propose a novel uni-flow process that models both modalities within a single generative framework. For video, the continuous flow matching formulation remains, operating in the latent space. For text, a discrete flow matching approach is adapted, where text tokens are mapped to continuous embeddings via a learnable matrix EEE. The model learns to predict the velocity field ut=zt,1zt,0u_t = z_{t,1} - z_{t,0}ut=zt,1zt,0 in this embedding space. Crucially, the two modalities are jointly learned in a single Transformer backbone. The key innovation lies in the modality-driven Mixture-of-Experts (MoE) architecture, which shares the attention layers to preserve cross-modal alignment while employing modality-specific FFN branches to capture domain-specific knowledge. The attention mechanism operates over the concatenated sequence of video and text tokens, enabling bidirectional cross-modal interaction. The resulting representations are then routed to modality-specific experts, FFNvFFN_vFFNv and FFNtFFN_tFFNt, ensuring that the shared attention patterns learned during pretraining are fully utilized while the FFN layers can specialize for their respective modalities. This design allows for efficient knowledge transfer from the pretrained video generator to the text generation task.

The training procedure consists of a two-stage bidirectional framework to effectively transfer and refine capabilities. The first stage, Knowledge Recall, initializes the model with a pretrained video generator and trains it to learn the reverse mapping from video to text. To prevent shortcut learning, the conditioning prompt is dropped with a certain probability, forcing the model to recover the text from the noisy video latent. The second stage, Capability Refinement, replaces the target text with detailed video captions, compelling the text generation branch to attend to the video latent to recover fine-grained visual details, thereby developing genuine video understanding. Inference is symmetric: for video generation, the model denoises the video latent from noise, guided by the text prompt; for video understanding, it denoises the text latent from noise, guided by the clean video. For joint generation, both modalities are initialized from noise and denoised in parallel, with their flows coupled through shared attention, allowing for mutual refinement.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp