Command Palette
Search for a command to run...
Pyramidal Flow Matching for Efficient Video Generative Modeling
Pyramidal Flow Matching for Efficient Video Generative Modeling
Abstract
Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.
One-sentence Summary
The authors from Peking University, Kuaishou Technology, and Beijing University of Posts and Telecommunications propose a unified pyramidal flow matching framework that reinterprets video generation as a multi-stage process, enabling end-to-end training with a single Diffusion Transformer while reducing computational cost through hierarchical flow interlinking and autoregressive temporal pyramid compression, achieving high-quality 768p 24 FPS video generation up to 10 seconds within 20.7k A100 GPU hours.
Key Contributions
- Existing video generation methods face high computational costs due to modeling large spatiotemporal spaces at full resolution, often relying on cascaded architectures that separately optimize multiple stages, leading to inefficiency and limited knowledge sharing.
- This work introduces a unified pyramidal flow matching framework that reinterprets the generation process as a sequence of spatial and temporal pyramid stages, where only the final stage operates at full resolution, enabling efficient joint optimization through a single Diffusion Transformer.
- The method achieves state-of-the-art video generation quality on 768p at 24 FPS for up to 10-second videos, reducing training tokens by over 85% compared to full-sequence diffusion and demonstrating strong performance on VBench and EvalCrafter benchmarks.
Introduction
Video generation using diffusion and autoregressive models has achieved impressive results in realism and duration, but remains computationally expensive due to the need to model large spatiotemporal data at full resolution. Prior approaches often rely on cascaded architectures that generate video in stages—first at low resolution and then upsampled with separate super-resolution models—reducing per-stage computation but introducing inefficiencies through disjoint training, limited knowledge sharing, and complex multi-model pipelines. The authors introduce pyramidal flow matching, a unified framework that leverages both spatial and temporal pyramids to compress video representations across scales. By reinterpreting the generation process as a sequence of interconnected stages operating at progressively finer resolutions, the method enables end-to-end training within a single Diffusion Transformer, eliminating the need for separate models. This design allows simultaneous generation and decompression while drastically reducing the number of tokens during training—cutting computational load by up to 87% for 10-second videos—without sacrificing quality, as demonstrated on benchmark datasets.
Method
The authors leverage a unified pyramidal flow matching framework to address the computational challenges of video generation by reinterpreting the denoising trajectory as a series of spatial pyramid stages. This approach avoids the need for separate models at each resolution, enabling knowledge sharing and a more efficient training process. The core of the method involves a spatial pyramid where the generation process is decomposed into multiple stages, each operating at a progressively higher resolution. The initial stages begin from a noisy, pixelated latent at a low resolution, and each subsequent stage refines the output, culminating in a full-resolution result at the final stage. This design significantly reduces computational cost, as most stages operate at lower resolutions, with only the final stage requiring full-resolution computation.

The unified modeling of the pyramidal flow is achieved by defining a conditional probability path that interpolates between different noise levels and resolutions. This path is constructed by sampling endpoints at each stage, where the starting point is derived from a lower-resolution, upsampled version of the clean data, and the ending point is a noisy version of the higher-resolution data. To ensure the flow trajectory is straight and the probability path is continuous, the authors enforce the noise at the endpoints to be in the same direction, which is achieved by sampling a single noise vector and using it to compute both endpoints. The flow model is then trained to regress the velocity field on this conditional vector field, which is the difference between the endpoints. This unified objective allows a single model to handle both the initial generation and the subsequent decompression or super-resolution steps, eliminating the need for separate models and facilitating knowledge sharing across stages.

During inference, the method employs a renoising scheme to handle the jump points between successive pyramid stages of different resolutions, ensuring the continuity of the probability path. The process begins by upsampleing the previous low-resolution endpoint, which results in a Gaussian distribution. To match this distribution with the target distribution at the next stage, a corrective noise is added. The renoising rule is derived to match both the mean and covariance of the distributions, ensuring a smooth transition. This involves a rescaling of the upsampled result and the addition of a small amount of noise with a specific covariance structure to decorrelate the pixels, which is crucial for preserving the signal at each jump point.

To further improve training efficiency, the authors introduce a temporal pyramid design for autoregressive video generation. This design reduces the computational redundancy in the full-resolution history condition by using compressed, lower-resolution history for each prediction. At each pyramid stage, the generation is conditioned on a history of past frames that are progressively downsampled in resolution. This significantly reduces the number of training tokens, as most frames are computed at the lowest resolution, thereby improving training efficiency. The position encoding is designed to be compatible with the pyramid structure, extrapolating in the spatial pyramid to preserve fine-grained details and interpolating in the temporal pyramid to ensure spatial alignment of the history conditions.

Experiment
- Proposed pyramidal flow matching framework achieves significant efficiency gains: reduces training computation by T2N2/16K and tokens by TN/4K, enabling 10-second video generation with only 20.7k A100 GPU hours—less than half the compute of Open-Sora 1.2 (37.8k H100 hours) while producing higher-quality videos; inference takes 56 seconds for a 5-second, 384p clip.
- On VBench and EvalCrafter benchmarks, the model surpasses all open-source text-to-video baselines, achieving a quality score of 84.74 (vs. 84.11 for Gen-3 Alpha) and top-2 motion AC and dynamic degree scores, despite training only on public data and using a 2B-parameter model initialized from SD3-Medium.
- User study confirms superior preference in motion smoothness and aesthetic quality over CogVideoX and Kling, attributed to autoregressive generation at 24 fps (vs. 8 fps in baselines), enabling longer, higher-fidelity videos.
- Ablation studies validate the spatial and temporal pyramids: pyramidal flow matching achieves nearly three times faster FID convergence and significantly better visual and motion quality than full-sequence diffusion baselines under identical training conditions.
- The corrective noise in spatial pyramid inference eliminates blocky artifacts and enhances detail; blockwise causal attention ensures temporal coherence, outperforming bidirectional attention which causes motion instability.
- Qualitative results show cinematic-quality video generation, with performance comparable to Gen-3 Alpha and Kling, and superior to CogVideoX-2B, even outperforming CogVideoX-5B in several metrics despite half the model size.
- Limitations include lack of support for non-autoregressive generation, subtle long-term subject inconsistency due to temporal pyramid compression, and challenges in face consistency and prompt fidelity, which can be addressed via improved data curation and compression methods.
The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their model, trained on public datasets, achieves competitive performance on VBench and EvalCrafter benchmarks, outperforming several open-source models and approaching the quality of commercial systems like Gen-3 Alpha and Kling, particularly in motion smoothness and dynamic degree.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their method outperforms several open-source and commercial models on key metrics such as visual quality and motion smoothness, achieving competitive performance with models trained on larger proprietary datasets while using only publicly available data.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their model, trained on public datasets, outperforms several open-source and commercial baselines in key metrics such as motion quality and visual fidelity, achieving competitive performance with models trained on larger proprietary data.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their method outperforms several open-source and commercial models on VBench, achieving a higher quality score (84.74) and motion smoothness than Gen-3 Alpha, while using only publicly available data and a smaller computational budget.

The authors use a user study to compare their model against several baselines, including Open-Sora, CogVideoX, and Kling, evaluating aesthetic quality, motion smoothness, and semantic alignment. Results show that their model is preferred over open-source models like Open-Sora and CogVideoX-2B, particularly in motion smoothness, due to the efficiency gains from pyramidal flow matching enabling higher frame rates. It also achieves competitive performance against commercial models like Kling and Gen-3 Alpha, especially in motion quality and visual aesthetics.
