Command Palette
Search for a command to run...
Uni-ViGU: A Diffusion-Based Video Generator를 통한 통합된 Video Generation 및 Understanding을 향하여
Uni-ViGU: A Diffusion-Based Video Generator를 통한 통합된 Video Generation 및 Understanding을 향하여
Luozheng Qin Jia Gong Qian Qiao Tianjiao Li Li Xu Haoyu Pan Chao Qu Zhiyu Tan Hao Li
초록
시각적 이해(understanding)와 생성(generation)을 통합하는 통합 멀티모달 모델(unified multimodal models)은 근본적인 과제에 직면해 있습니다. 바로 시각적 생성, 특히 비디오 생성의 연산 비용이 이해 과정보다 훨씬 높다는 점입니다. 이러한 불균형으로 인해 본 연구는 기존의 패러다임을 뒤집고자 합니다. 즉, 이해 중심의 MLLM을 생성 기능까지 확장하는 대신, 비디오 생성기(video generator)를 기반으로 확장하여 비디오 생성과 이해를 통합하는 프레임워크인 Uni-ViGU를 제안합니다.우리는 단일 프로세스 내에서 비디오를 위한 연속적 flow matching과 텍스트를 위한 이산적(discrete) flow matching을 수행하여 일관된 멀티모달 생성을 가능하게 하는 통합 flow 방법을 도입합니다. 나아가, 생성적 사전 지식(generative priors)을 유지하면서 텍스트 생성을 위해 Transformer block에 경량 레이어를 추가하는 modality-driven MoE 기반 프레임워크를 제안합니다.생성 지식을 이해 작업에 재사용하기 위해, 우리는 두 단계로 구성된 양방향 학습 메커니즘을 설계했습니다. 첫째, Knowledge Recall 단계에서는 학습된 텍스트-비디오 대응 관계를 활용하기 위해 입력 prompt를 재구성(reconstruct)합니다. 둘째, Capability Refinement 단계에서는 상세한 캡션(captions)을 통해 미세 조정(fine-tuning)을 수행하여 판별 가능한 공유 표현(discriminative shared representations)을 구축합니다.실험 결과, Uni-ViGU는 비디오 생성과 이해 모두에서 경쟁력 있는 성능을 달성하였으며, 이는 생성 중심(generation-centric) 아키텍처가 통합 멀티모달 지능을 향한 확장 가능한 경로임을 입증합니다.Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/
One-sentence Summary
Uni-ViGU unifies video generation and understanding by extending a diffusion-based video generator through a unified flow matching method and a modality-driven MoE-based architecture, utilizing a two-stage bidirectional training mechanism to repurpose generative priors for discriminative understanding.
Key Contributions
- The paper introduces Uni-ViGU, a framework that unifies video generation and understanding by extending a pretrained video generator as a foundation to leverage existing spatiotemporal priors.
- A unified flow formulation is presented that enables coherent multimodal generation by performing continuous flow matching for video and discrete flow matching for text within a single process.
- The work implements a modality-driven Mixture-of-Experts (MoE) architecture and a bidirectional training mechanism consisting of Knowledge Recall and Capability Refinement to repurpose generative knowledge for discriminative video understanding.
Introduction
Integrating visual understanding and generation into a single model is essential for developing general purpose visual intelligence. Current approaches typically extend understanding centric multimodal large language models to support generation, but this faces massive scalability issues because video generation requires processing millions of tokens through iterative denoising. The authors propose Uni-ViGU, a framework that inverts this paradigm by using a video generator as the foundational architecture. They introduce a unified flow method that combines continuous flow matching for video with discrete flow matching for text within a single process. To enable this, the authors leverage a modality driven MoE based architecture that augments Transformer blocks with lightweight layers for text while preserving generative priors, alongside a bidirectional training mechanism to repurpose learned text to video correspondences for video understanding.
Dataset

The authors utilize a meticulously curated dataset of synthesized video-text pairs to train Uni-ViGU through a two-stage bidirectional framework. The dataset details are as follows:
- Dataset Composition and Sources: The data is synthesized by using state-of-the-art video generators to create videos from a set of initial conditioning prompts. An LLM is then used to analyze each video-prompt pair to generate highly detailed captions that enrich the original prompt's information.
- Subsets and Training Usage:
- Stage 1 (Knowledge Recall): The model is trained on 10K video-prompt pairs. In this stage, the target text is identical to the conditioning prompt, though condition dropout is applied to prevent the model from simply copying the input.
- Stage 2 (Capability Refinement): The model undergoes fine-tuning on an additional 10K video-prompt-detailed caption triples. Here, the model is conditioned on a brief prompt but tasked with generating a semantically precise, detailed caption.
- Processing and Constraints: To ensure the model develops genuine comprehension rather than trivial inference, the authors enforce strict token-length constraints. Conditioning prompts are limited to 0 to 128 tokens, while detailed captions are restricted to 128 to 256 tokens. This length separation forces the model to rely on the shared attention mechanism to bridge the gap between the brief prompt and the rich description.
Method
The authors leverage the latent diffusion framework of WAN2.1, a state-of-the-art text-to-video generator, as the foundation for their unified model. This framework operates in a compressed latent space, enabling efficient video generation through iterative denoising. The process begins with a video x being encoded into a latent representation z1=E(x) by a Variational Autoencoder (VAE). The model learns a diffusion process by defining a continuous transport path from Gaussian noise z0 to the data latent z1 via linear interpolation, zt=(1−t)z0+tz1. A neural network, specifically a Diffusion Transformer (DiT), is trained to predict the velocity field u=z1−z0 conditioned on the text prompt c, the intermediate latent zt, and the time step t, optimizing a flow matching loss. Inference proceeds by integrating this learned velocity field from t=0 to t=1 to generate the final latent, which is then decoded into the output video x^=D(z1).

The core architecture of the video generator is a DiT, composed of multiple transformer blocks. Each block processes the input through a sequence of layers: self-attention, cross-attention, and a feed-forward network (FFN). The self-attention layer captures spatial and temporal dependencies within the video features, while the cross-attention layer integrates semantic information from the text prompt c, which is used as the key-value pair. The FFN layer performs position-wise transformations. This structure is extended to support a unified text-video generation framework, as shown in the figure below.

To unify video and text generation, the authors propose a novel uni-flow process that models both modalities within a single generative framework. For video, the continuous flow matching formulation remains, operating in the latent space. For text, a discrete flow matching approach is adapted, where text tokens are mapped to continuous embeddings via a learnable matrix E. The model learns to predict the velocity field ut=zt,1−zt,0 in this embedding space. Crucially, the two modalities are jointly learned in a single Transformer backbone. The key innovation lies in the modality-driven Mixture-of-Experts (MoE) architecture, which shares the attention layers to preserve cross-modal alignment while employing modality-specific FFN branches to capture domain-specific knowledge. The attention mechanism operates over the concatenated sequence of video and text tokens, enabling bidirectional cross-modal interaction. The resulting representations are then routed to modality-specific experts, FFNv and FFNt, ensuring that the shared attention patterns learned during pretraining are fully utilized while the FFN layers can specialize for their respective modalities. This design allows for efficient knowledge transfer from the pretrained video generator to the text generation task.
The training procedure consists of a two-stage bidirectional framework to effectively transfer and refine capabilities. The first stage, Knowledge Recall, initializes the model with a pretrained video generator and trains it to learn the reverse mapping from video to text. To prevent shortcut learning, the conditioning prompt is dropped with a certain probability, forcing the model to recover the text from the noisy video latent. The second stage, Capability Refinement, replaces the target text with detailed video captions, compelling the text generation branch to attend to the video latent to recover fine-grained visual details, thereby developing genuine video understanding. Inference is symmetric: for video generation, the model denoises the video latent from noise, guided by the text prompt; for video understanding, it denoises the text latent from noise, guided by the clean video. For joint generation, both modalities are initialized from noise and denoised in parallel, with their flows coupled through shared attention, allowing for mutual refinement.