HyperAIHyperAI

Command Palette

Search for a command to run...

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Abstract

Significant advancements have been achieved in the realm of large-scalepre-trained text-to-video Diffusion Models (VDMs). However, previous methodseither rely solely on pixel-based VDMs, which come with high computationalcosts, or on latent-based VDMs, which often struggle with precise text-videoalignment. In this paper, we are the first to propose a hybrid model, dubbed asShow-1, which marries pixel-based and latent-based VDMs for text-to-videogeneration. Our model first uses pixel-based VDMs to produce a low-resolutionvideo of strong text-video correlation. After that, we propose a novel experttranslation method that employs the latent-based VDMs to further upsample thelow-resolution video to high resolution, which can also remove potentialartifacts and corruptions from low-resolution videos. Compared to latent VDMs,Show-1 can produce high-quality videos of precise text-video alignment;Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage duringinference is 15G vs 72G). Furthermore, our Show-1 model can be readily adaptedfor motion customization and video stylization applications through simpletemporal attention layer finetuning. Our model achieves state-of-the-artperformance on standard video generation benchmarks. Our code and model weightsare publicly available at https://github.com/showlab/Show-1.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation | Papers | HyperAI