HyperAIHyperAI

Command Palette

Search for a command to run...

Pyramidal Flow Matching for Efficient Video Generative Modeling

Abstract

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.

One-sentence Summary

The authors from Peking University, Kuaishou Technology, Beijing University of Posts and Telecommunications, and Pazhou Laboratory propose a unified pyramidal flow matching framework that reinterprets video generation as a multi-stage process, enabling end-to-end training with a single Diffusion Transformer while reducing computational cost through hierarchical flow interlinking and autoregressive temporal pyramid compression, achieving high-quality 768p 24 FPS video synthesis up to 10 seconds within 20.7k A100 GPU hours.

Key Contributions

  • The paper addresses the high computational cost of video generation by introducing a unified pyramidal flow matching framework that reinterprets the denoising process as a sequence of progressively refined stages, operating at compressed resolutions until the final full-resolution output, thereby reducing redundant computation during early, noisy timesteps.

  • It proposes a novel pyramidal flow matching algorithm that integrates spatial and temporal pyramid representations within a single Diffusion Transformer, enabling end-to-end training with interlinked generation trajectories and autoregressive generation conditioned on compressed temporal history, significantly improving training efficiency.

  • The method achieves state-of-the-art video generation quality on 768p at 24 FPS for up to 10-second videos, with training completed in just 20.7k A100 GPU hours on public datasets, and demonstrates competitive performance on VBench and EvalCrafter benchmarks.

Introduction

Video generation using diffusion and autoregressive models has achieved impressive results in realism and duration, but remains computationally expensive due to the need to model large spatiotemporal spaces. Prior approaches often rely on cascaded architectures that generate video in stages—first at low resolution and then upsampled with separate super-resolution models. While this reduces per-stage computation, it introduces inefficiencies through disjoint model training, limited knowledge sharing, and lack of scalability. The authors introduce pyramidal flow matching, a unified framework that reinterprets video generation as a multi-scale process operating across spatial and temporal pyramids. By compressing latent representations at earlier stages and progressively decompressing them, the method reduces the number of tokens during training—cutting computational load by up to 87% for 10-second videos—while maintaining high-quality output. The key innovation is a single Diffusion Transformer trained end-to-end with a unified flow matching objective, enabling simultaneous generation and decompression across pyramid levels. This approach eliminates the need for separate models, supports efficient autoregressive conditioning via compressed temporal history, and achieves competitive performance on benchmark datasets without requiring proprietary data.

Method

The authors leverage a unified pyramidal flow matching framework to address the computational challenges of video generation by reinterpreting the denoising trajectory as a series of spatial pyramid stages. This approach avoids the need for separate models at each resolution, enabling knowledge sharing and a more efficient training process. The core of the method is a spatial pyramid that divides the generation process into multiple stages, each operating at a progressively higher resolution. As shown in the framework diagram, the initial stages begin from a pixelated and noisy latent representation, which is progressively refined through a series of upsampling and denoising steps. The final stage operates at the full resolution, ensuring high-quality output. This design significantly reduces computational cost, as most stages are performed at lower resolutions, with the overall complexity reduced by a factor of nearly 1/K1/K1/K, where KKK is the number of stages.

The unified modeling of the pyramid stages is achieved by defining a conditional probability path that interpolates between different noise levels and resolutions. For each stage kkk, the path is defined by a start point x^sk\hat{x}_{s_k}x^sk and an end point x^ek\hat{x}_{e_k}x^ek, both derived from the target data x1x_1x1 using downsampling and upsampling functions. To ensure the flow trajectory is straight and the model learns a consistent path, the authors couple the sampling of these endpoints by using the same noise vector n\boldsymbol{n}n. This results in a unified flow matching objective that regresses the model's predicted velocity vtv_tvt against the difference between the end and start points, x^ekx^sk\hat{x}_{e_k} - \hat{x}_{s_k}x^ekx^sk, across all stages. This single objective allows a single model to simultaneously learn the generation and decompression tasks, facilitating knowledge sharing between stages.

During inference, the model samples from the probability path within each pyramid stage. However, a critical challenge arises at the jump points between stages of different resolutions, where the probability path must remain continuous. To address this, the authors introduce a renoising scheme that handles these jump points. The process begins by upsampleing the previous stage's endpoint, x^ek+1\hat{x}_{e_{k+1}}x^ek+1, to the current stage's resolution. The resulting distribution is then corrected with a small amount of noise to match the desired distribution of the start point x^sk\hat{x}_{s_k}x^sk. This corrective noise is designed to decorrelate the upsampled features, and the authors derive a specific update rule that ensures the means and covariances of the distributions are matched. This renoising step is crucial for maintaining the continuity of the probability path and ensuring a smooth transition between stages.

To further improve training efficiency, the authors introduce a temporal pyramid design for autoregressive video generation. This design reduces the computational redundancy in the full-resolution history condition by using compressed, lower-resolution history frames as the input for the current frame prediction. As illustrated in the diagram, at each pyramid stage, the generation is conditioned on a history of past frames that have been progressively downsampled. This significantly reduces the number of tokens required for training, as most frames are processed at the lowest resolution. The position encoding is designed to be compatible with this pyramid structure, extrapolating in the spatial dimension to preserve fine-grained details and interpolating in the temporal dimension to ensure spatial alignment of the history conditions. This combined spatial and temporal pyramid design allows the model to be trained efficiently on long videos while maintaining high quality.

Experiment

  • Proposed pyramidal flow matching framework reduces computational and memory overhead in video generation training by up to 16^K, enabling training of a 10-second, 241-frame video model in only 20.7k A100 GPU hours—significantly less than Open-Sora 1.2, which requires over 40k GPU hours for fewer frames and lower quality.
  • On VBench and EvalCrafter benchmarks, the model achieves state-of-the-art performance among open-source methods, surpassing CogVideoX-5B (twice the model size) in quality and motion smoothness (84.74 vs. 84.11 on VBench), and outperforming Gen-3 Alpha in key metrics despite training on public data only.
  • User studies confirm superior preference in aesthetic quality, motion smoothness, and semantic alignment over open-source baselines like CogVideoX and Open-Sora, particularly due to support for 24 fps generation (vs. 8 fps in baselines).
  • Ablation studies validate the effectiveness of spatial and temporal pyramids: the spatial pyramid accelerates FID convergence by nearly three times, while the temporal pyramid enables stable, coherent video generation where full-sequence diffusion fails to converge.
  • The model demonstrates strong image-to-video generation capability via autoregressive inference, producing 5-second 768p videos with rich motion dynamics, and generates high-quality images even with few million training samples.
  • Corrective noise in spatial pyramid inference and blockwise causal attention are critical for reducing artifacts and ensuring temporal coherence, with ablations showing significant degradation in visual quality and motion consistency without them.
  • Despite limitations in long-term subject consistency and lack of non-autoregressive generation, the method achieves cinematic-quality video generation with competitive performance against commercial models using a fraction of the training cost.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their model, trained on public datasets, achieves competitive performance on VBench and EvalCrafter benchmarks, outperforming several open-source models and approaching the quality of commercial systems like Gen-3 Alpha and Kling, particularly in motion smoothness and dynamic degree.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their method outperforms several open-source and commercial models on key metrics such as visual quality and motion smoothness, achieving competitive performance with models trained on larger proprietary datasets while using only publicly available data.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their model, trained on public datasets, outperforms several open-source and commercial baselines on VBench and EvalCrafter, achieving high scores in quality and motion smoothness while using substantially less computational resources.

The authors use a pyramidal flow matching framework to achieve efficient video generation, significantly reducing computational and memory requirements compared to full-sequence diffusion. Results show that their method outperforms several open-source and commercial models on VBench, achieving a higher quality score (84.74) and motion smoothness (99.12) while using only publicly available training data.

The authors use a user study to compare their model against several baselines, including Open-Sora, CogVideoX, and Kling, on aesthetic quality, motion smoothness, and semantic alignment. Results show that their model is preferred over open-source models like Open-Sora and CogVideoX-2B, particularly in motion smoothness, due to the efficiency gains from pyramidal flow matching enabling higher frame rates. It also achieves competitive performance with commercial models like Kling and Gen-3 Alpha, especially in motion quality and visual aesthetics.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp