⚡️Pyramid Flow⚡️: Training-Efficient Autoregressive Video Generation Model Based on Flow Matching

1. Tutorial Introduction

Pyramid Flow is an open source ultra-high-definition video generation model launched in 2024 by a research team jointly established by Kuaishou, Peking University and Beijing University of Posts and Telecommunications.Pyramidal Flow Matching for Efficient Video Generative ModelingThis model can generate high-quality videos with a maximum length of 10 seconds, a resolution of up to 1280×768, and a frame rate of 24fps based on text descriptions. The core technology of Pyramid Flow is the pyramid flow matching algorithm, which decomposes the video generation process into multiple stages with different resolutions, thereby improving generation efficiency and quality.

This tutorial is based on a training-efficient autoregressive video generation method based on stream matching. By training only on open source datasets, it can generate high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation. This tutorial supports the following models and features:

Two model checkpoints:

768p: Supports up to 10 seconds of video at 24FPS
384p: Supports 5-second video at 24FPS

Two functions:

Vincent video (text_to_video)
Image to video generation (image_to_video)

2. Operation steps

After starting the container, click the API address to enter the Web interface

1. Text to video (text_to_video)

choose Text-to-Video Function, enter the prompt words and related settings as required below.

prompt: A text prompt used as a guide for video generation. Note that it cannot exceed 128 characters.
Duration: The length of the generated video, Duration=16: 5s, temp=31: 10s.
guidance_scale: controls the visual quality. We recommend using guidance within [7, 9] for 768p checkpoints and within 7 for 384p checkpoints during text-to-video generation.
video_guidance_scale: Controls motion. Larger values increase the dynamics and mitigate autoregressive generation degradation, while smaller values stabilize the video. For 10-second video generation, we recommend using 7-level guidance scale and 5-level video guidance scale. After testing, it takes about 4 minutes to generate a 5-second video using a 768p checkpoint (larger model), and about 2 minutes to generate a 5-second video using a 384p model (smaller model).

Figure 1: Demonstration of Vincent video function

2. Image to video generation (image_to_video)

choose Image_to_Video Function, enter the prompt words and related settings as required below.

input_image: upload the original image
prompt: A text prompt used as a guide for video generation. Note that it cannot exceed 128 characters.
Duration: The length of the generated video, Duration=16: 5s, temp=31: 10s.
video_guidance_scale: Controls motion. Larger values increase the dynamics and mitigate autoregressive generation degradation, while smaller values stabilize the video. For 10-second video generation, we recommend using 7-level guidance scale and 5-level video guidance scale. After testing, it takes about 3 minutes to generate a 5-second video using a 768p checkpoint (larger model), and about 2 minutes to generate a 5-second video using a 384p model (smaller model).

Figure 2 Image generation video demonstration

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓