HyperAIHyperAI

Command Palette

Search for a command to run...

Uni-ViGU : Vers une génération et une compréhension vidéo unifiées via un générateur vidéo basé sur la Diffusion.

Luozheng Qin Jia Gong Qian Qiao Tianjiao Li Li Xu Haoyu Pan Chao Qu Zhiyu Tan Hao Li

Résumé

Voici la traduction de votre texte en français, respectant les standards de la communication scientifique et technologique :Les modèles multimodaux unifiés intégrant la compréhension et la génération visuelles sont confrontés à un défi fondamental : la génération visuelle engendre des coûts de calcul nettement plus élevés que la compréhension, en particulier pour la vidéo. Ce déséquilibre nous incite à inverser le paradigme conventionnel : plutôt que d'étendre les MLLMs centrés sur la compréhension pour supporter la génération, nous proposons Uni-ViGU, un framework qui unifie la génération et la compréhension vidéo en utilisant un générateur vidéo comme fondation. Nous introduisons une méthode de flux unifiée qui exécute un continuous flow matching pour la vidéo et un discrete flow matching pour le texte au sein d'un processus unique, permettant ainsi une génération multimodale cohérente. De plus, nous proposons un framework basé sur un MoE (Mixture of Experts) piloté par la modalité, qui augmente les blocs Transformer avec des couches légères pour la génération de texte tout en préservant les generative priors. Afin de réutiliser les connaissances de génération pour la compréhension, nous concevons un mécanisme d'entraînement bidirectionnel en deux étapes : le Knowledge Recall reconstruit les prompts d'entrée pour tirer parti des correspondances texte-vidéo apprises, tandis que le Capability Refinement effectue un fine-tuning sur des captions détaillées pour établir des représentations partagées discriminantes. Les expériences démontrent qu'Uni-ViGU atteint des performances compétitives tant en génération qu'en compréhension vidéo, validant ainsi les architectures centrées sur la génération comme une voie évolutive (scalable) vers une intelligence multimodale unifiée.Page du projet et Code : https://fr0zencrane.github.io/uni-vigu-page/.

One-sentence Summary

Uni-ViGU unifies video generation and understanding by extending a diffusion-based video generator through a unified flow matching method and a modality-driven MoE-based architecture, utilizing a two-stage bidirectional training mechanism to repurpose generative priors for discriminative understanding.

Key Contributions

  • The paper introduces Uni-ViGU, a framework that unifies video generation and understanding by extending a pretrained video generator as a foundation to leverage existing spatiotemporal priors.
  • A unified flow formulation is presented that enables coherent multimodal generation by performing continuous flow matching for video and discrete flow matching for text within a single process.
  • The work implements a modality-driven Mixture-of-Experts (MoE) architecture and a bidirectional training mechanism consisting of Knowledge Recall and Capability Refinement to repurpose generative knowledge for discriminative video understanding.

Introduction

Integrating visual understanding and generation into a single model is essential for developing general purpose visual intelligence. Current approaches typically extend understanding centric multimodal large language models to support generation, but this faces massive scalability issues because video generation requires processing millions of tokens through iterative denoising. The authors propose Uni-ViGU, a framework that inverts this paradigm by using a video generator as the foundational architecture. They introduce a unified flow method that combines continuous flow matching for video with discrete flow matching for text within a single process. To enable this, the authors leverage a modality driven MoE based architecture that augments Transformer blocks with lightweight layers for text while preserving generative priors, alongside a bidirectional training mechanism to repurpose learned text to video correspondences for video understanding.

Dataset

Dataset overview
Dataset overview

The authors utilize a meticulously curated dataset of synthesized video-text pairs to train Uni-ViGU through a two-stage bidirectional framework. The dataset details are as follows:

  • Dataset Composition and Sources: The data is synthesized by using state-of-the-art video generators to create videos from a set of initial conditioning prompts. An LLM is then used to analyze each video-prompt pair to generate highly detailed captions that enrich the original prompt's information.
  • Subsets and Training Usage:
    • Stage 1 (Knowledge Recall): The model is trained on 10K video-prompt pairs. In this stage, the target text is identical to the conditioning prompt, though condition dropout is applied to prevent the model from simply copying the input.
    • Stage 2 (Capability Refinement): The model undergoes fine-tuning on an additional 10K video-prompt-detailed caption triples. Here, the model is conditioned on a brief prompt but tasked with generating a semantically precise, detailed caption.
  • Processing and Constraints: To ensure the model develops genuine comprehension rather than trivial inference, the authors enforce strict token-length constraints. Conditioning prompts are limited to 0 to 128 tokens, while detailed captions are restricted to 128 to 256 tokens. This length separation forces the model to rely on the shared attention mechanism to bridge the gap between the brief prompt and the rich description.

Method

The authors leverage the latent diffusion framework of WAN2.1, a state-of-the-art text-to-video generator, as the foundation for their unified model. This framework operates in a compressed latent space, enabling efficient video generation through iterative denoising. The process begins with a video xxx being encoded into a latent representation z1=E(x)z_1 = \mathcal{E}(x)z1=E(x) by a Variational Autoencoder (VAE). The model learns a diffusion process by defining a continuous transport path from Gaussian noise z0z_0z0 to the data latent z1z_1z1 via linear interpolation, zt=(1t)z0+tz1z_t = (1-t)z_0 + tz_1zt=(1t)z0+tz1. A neural network, specifically a Diffusion Transformer (DiT), is trained to predict the velocity field u=z1z0u = z_1 - z_0u=z1z0 conditioned on the text prompt ccc, the intermediate latent ztz_tzt, and the time step ttt, optimizing a flow matching loss. Inference proceeds by integrating this learned velocity field from t=0t=0t=0 to t=1t=1t=1 to generate the final latent, which is then decoded into the output video x^=D(z1)\hat{x} = \mathcal{D}(z_1)x^=D(z1).

WAN DiT Block and Uni-ViGU DiT Block
WAN DiT Block and Uni-ViGU DiT Block

The core architecture of the video generator is a DiT, composed of multiple transformer blocks. Each block processes the input through a sequence of layers: self-attention, cross-attention, and a feed-forward network (FFN). The self-attention layer captures spatial and temporal dependencies within the video features, while the cross-attention layer integrates semantic information from the text prompt ccc, which is used as the key-value pair. The FFN layer performs position-wise transformations. This structure is extended to support a unified text-video generation framework, as shown in the figure below.

Unified Flow Matching
Unified Flow Matching

To unify video and text generation, the authors propose a novel uni-flow process that models both modalities within a single generative framework. For video, the continuous flow matching formulation remains, operating in the latent space. For text, a discrete flow matching approach is adapted, where text tokens are mapped to continuous embeddings via a learnable matrix EEE. The model learns to predict the velocity field ut=zt,1zt,0u_t = z_{t,1} - z_{t,0}ut=zt,1zt,0 in this embedding space. Crucially, the two modalities are jointly learned in a single Transformer backbone. The key innovation lies in the modality-driven Mixture-of-Experts (MoE) architecture, which shares the attention layers to preserve cross-modal alignment while employing modality-specific FFN branches to capture domain-specific knowledge. The attention mechanism operates over the concatenated sequence of video and text tokens, enabling bidirectional cross-modal interaction. The resulting representations are then routed to modality-specific experts, FFNvFFN_vFFNv and FFNtFFN_tFFNt, ensuring that the shared attention patterns learned during pretraining are fully utilized while the FFN layers can specialize for their respective modalities. This design allows for efficient knowledge transfer from the pretrained video generator to the text generation task.

The training procedure consists of a two-stage bidirectional framework to effectively transfer and refine capabilities. The first stage, Knowledge Recall, initializes the model with a pretrained video generator and trains it to learn the reverse mapping from video to text. To prevent shortcut learning, the conditioning prompt is dropped with a certain probability, forcing the model to recover the text from the noisy video latent. The second stage, Capability Refinement, replaces the target text with detailed video captions, compelling the text generation branch to attend to the video latent to recover fine-grained visual details, thereby developing genuine video understanding. Inference is symmetric: for video generation, the model denoises the video latent from noise, guided by the text prompt; for video understanding, it denoises the text latent from noise, guided by the clean video. For joint generation, both modalities are initialized from noise and denoised in parallel, with their flows coupled through shared attention, allowing for mutual refinement.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp