Command Palette
Search for a command to run...
SemanticGen : Génération vidéo dans l'espace sémantique
SemanticGen : Génération vidéo dans l'espace sémantique
Abstract
Les modèles de génération vidéo de pointe apprennent généralement la distribution des latents vidéo dans l’espace VAE et les transforment en pixels à l’aide d’un décodeur VAE. Bien que cette approche permette de générer des vidéos de haute qualité, elle souffre d’une convergence lente et d’un coût computationnel élevé lorsqu’il s’agit de produire des vidéos longues. Dans cet article, nous introduisons SemanticGen, une solution novatrice visant à surmonter ces limitations en générant des vidéos directement dans l’espace sémantique. Notre principale intuition repose sur le fait que, du fait de la redondance inhérente aux vidéos, le processus de génération devrait commencer dans un espace sémantique compact et de haut niveau, destiné à la planification globale, avant d’ajouter progressivement des détails à haute fréquence, plutôt que de modéliser directement un grand ensemble de tokens vidéo de bas niveau à l’aide d’une attention bidirectionnelle. SemanticGen adopte un processus de génération en deux étapes. Dans la première étape, un modèle de diffusion génère des caractéristiques vidéo sémantiques compactes, qui définissent l’agencement global de la vidéo. Dans la deuxième étape, un autre modèle de diffusion génère les latents VAE conditionnés à ces caractéristiques sémantiques afin de produire la sortie finale. Nous observons que la génération dans l’espace sémantique conduit à une convergence plus rapide qu’avec l’espace latent VAE. Notre méthode s’avère également efficace et économiquement rentable en termes de ressources lorsqu’elle est étendue à la génération de vidéos longues. Des expériences étendues démontrent que SemanticGen produit des vidéos de haute qualité et surpasser les approches de pointe ainsi que des baselines performantes.
One-sentence Summary
Researchers from Zhejiang University, Kuaishou Technology (Kling Team), and CUHK propose SemanticGen, a two-stage video generation framework operating in semantic space rather than VAE latents. It first creates compact semantic features for global video planning using diffusion, then generates VAE latents conditioned on these features, accelerating convergence and enabling efficient high-quality one-minute video synthesis while outperforming prior VAE-based methods.
Key Contributions
- Current video generative models suffer from slow convergence and high computational costs for long videos due to modeling low-level video tokens with bidirectional attention in the VAE latent space, which becomes impractical for extended sequences.
- SemanticGen introduces a two-stage diffusion framework that first generates compact semantic features for global video layout planning, then conditions VAE latent generation on these features to add high-frequency details, leveraging video redundancy for efficiency.
- Experiments show this approach achieves faster convergence than VAE latent space generation and outperforms state-of-the-art methods in video quality while scaling effectively to long video generation scenarios.
Introduction
Video generation models face a critical trade-off: diffusion-based methods produce high-fidelity short videos but struggle with long sequences due to the computational limits of processing all frames simultaneously via bidirectional attention, while autoregressive approaches scale better for long videos yet deliver inferior quality. Hybrid methods attempt to bridge this gap but still underperform pure diffusion models. Separately, research shows that integrating semantic representations—like those from pre-trained encoders—accelerates training and boosts image generation quality compared to raw latent spaces. The authors address both challenges by proposing a novel diffusion-based video generation framework. They fine-tune diffusion models to learn compressed semantic representations, which are then mapped to the VAE latent space. This approach achieves faster convergence than direct VAE latent generation and effectively scales to long video synthesis, overcoming prior scalability barriers without sacrificing fidelity.
Dataset
The authors use two distinct datasets for short and long video generation:
-
Dataset composition and sources:
The training data comprises an internal text-video pair dataset for short videos and a constructed long-video dataset sourced from movie and TV show clips. -
Key subset details:
- Short video subset: Uses internal text-video pairs (exact size unspecified).
- Long video subset: Created by splitting movie/TV clips into 60-second segments, with text prompts auto-generated via an internal captioner. No explicit filtering rules are detailed.
-
Data usage in training:
- Frames are sampled at 24 fps for VAE input and 1.6 fps for the semantic encoder during training.
- The semantic encoder leverages the vision tower from Qwen2.5-VL-72B-Instruct.
- Training focuses on text-to-video generation without specified mixture ratios or split proportions.
-
Processing specifics:
- No cropping strategy is mentioned.
- Metadata (text prompts) for long videos is synthetically constructed via the internal captioner.
- Frame-rate differentiation (24 fps vs. 1.6 fps) tailors inputs to the VAE and semantic encoder workflows.
Method
The authors leverage a two-stage diffusion-based framework called SemanticGen to generate videos by first modeling high-level semantic representations and then refining them into VAE latents. This design addresses the computational inefficiency and slow convergence of conventional methods that operate directly in the VAE latent space. The overall architecture is structured to decouple global scene planning from fine-grained detail synthesis, enabling scalable and efficient generation, particularly for long videos.
In the first stage, a semantic generator is trained to produce compact semantic representations conditioned on text prompts. These representations are derived from an off-the-shelf video semantic encoder — specifically, the vision tower of Qwen-2.5-VL — which processes input videos by downsampling frames and compressing spatial patches into a low-dimensional token sequence. To improve training stability and convergence, the authors introduce a lightweight MLP that compresses the high-dimensional semantic features zsem′ into a lower-dimensional latent space zsem, parameterized as a Gaussian distribution. The MLP outputs mean and variance vectors, and a KL divergence term is added to the loss to regularize the compressed space. This compressed representation is then used as conditioning for the second stage.
Refer to the framework diagram for the semantic generation pipeline. The semantic generator takes a text prompt as input and outputs zsem, which is further processed by the MLP to yield zsem′, the compressed semantic embedding used for conditioning. This design ensures that the semantic space is both compact and amenable to diffusion sampling.

In the second stage, a video diffusion model generates VAE latents conditioned on the compressed semantic representation. During training, the model receives concatenated inputs: the noised VAE latent zt and the semantic embedding zsem. This in-context conditioning allows the diffusion model to leverage high-level semantic cues — such as object motion and scene layout — while generating low-level visual details. The authors observe that this conditioning preserves global structure and dynamics while allowing variation in texture and color, as demonstrated in reference-based generation experiments.
As shown in the figure below, the VAE latent generator integrates the semantic encoder and MLP to extract and compress semantic features from input videos, which are then injected into the diffusion model alongside the noised VAE latents. During inference, the semantic generator produces zsem from text, and noise is added to it to bridge the training-inference gap, following practices from RAE.

For long video generation, SemanticGen employs a hybrid attention strategy to manage computational complexity. In the semantic space, full self-attention is applied to maintain global consistency across scenes and characters. When mapping to the VAE latent space, shifted-window attention is used to limit quadratic growth in compute cost. The authors interleave VAE latent tokens and semantic tokens within attention windows of size Tw, shifting the window by Tw/2 at odd layers. This design ensures that the semantic space — which is 16x more compressed than the VAE space — remains tractable while enabling high-fidelity reconstruction.
The inference pipeline integrates both stages: the semantic generator produces zsem from text, which is then fed into the VAE latent generator to produce the final video. The entire process is illustrated in the overall framework diagram, which highlights the sequential nature of semantic-to-latent generation.

The base diffusion model follows the Rectified Flow framework, with the forward process defined as zt=(1−t)z0+tϵ, where ϵ∼N(0,I). Training minimizes the Conditional Flow Matching loss LLCM=Et,p+(z,ϵ),p(ϵ)∥vΘ(zt,t)−ut(z0∣ϵ)∥22, and inference uses Euler discretization: zt=zt−1+vΘ(zt−1,t)⋅Δt. This foundation enables efficient sampling while preserving temporal coherence.

The authors emphasize that SemanticGen is not tied to a specific semantic encoder; alternatives such as V-JEPA 2, VideoMAE 2, or 4DS can be substituted. The framework’s modularity and reliance on pre-trained semantic encoders make it adaptable to evolving video understanding models while maintaining efficiency and quality in generation.
Experiment
- Validated semantic space compression: Using an 8-dimensional representation achieved the highest VBench scores, reducing artifacts and broken frames compared to higher-dimensional variants (64-dim and uncompressed).
- Confirmed faster convergence in semantic space: Training on pre-trained semantic representations converged significantly faster than modeling compressed VAE latents, producing coherent videos within the same training steps where VAE-based approaches yielded only coarse color patches.
- Outperformed state-of-the-art in long video generation: Surpassed all baselines (including MAGI-1, SkyReels-V2, and continued-training variants) in temporal consistency and reduced drifting issues, achieving lower ΔdriftM scores and maintaining stable backgrounds without color shifts.
- Matched SOTA in short video quality: Achieved comparable VBench metrics to leading methods while demonstrating superior text adherence (e.g., correctly generating specific motions like head turns or melting snowflakes missed by baselines).
The authors evaluate the impact of semantic space compression on video generation quality by comparing models trained with different output dimensions. Results show that compressing the semantic representation to lower dimensions (dim=8) improves consistency, smoothness, and aesthetic quality compared to no compression (dim=2048), with the dim=8 variant achieving the highest scores across most metrics. This indicates that dimensionality reduction enhances both convergence and output fidelity in the semantic generation pipeline.

The authors evaluate SemanticGen against baselines using quantitative metrics for long video generation, showing it achieves the highest subject consistency and motion smoothness while maintaining strong background consistency and minimal temporal flickering. SemanticGen outperforms Hunyuan-Video and Wan2.1-T2V-14B in subject consistency and matches or exceeds them in other stability metrics, though it scores slightly lower than Wan2.1-T2V-14B in imaging and aesthetic quality. Results confirm SemanticGen’s effectiveness in preserving long-term coherence without significant degradation in visual fidelity.

SemanticGen outperforms baseline methods across multiple video quality metrics, achieving the highest scores in subject consistency, background consistency, motion smoothness, and imaging quality. It also demonstrates the lowest drifting metric (Δ_drift^M), indicating superior long-term temporal stability compared to alternatives like SkyReels-V2, Self-Forcing, LongLive, and Base-CT-Swin.
