Command Palette
Search for a command to run...
Jihwan Kim Junoh Kang Jinyoung Choi Bohyung Han

Abstract
We propose a novel inference technique based on a pretrained diffusion modelfor text-conditional video generation. Our approach, called FIFO-Diffusion, isconceptually capable of generating infinitely long videos without training.This is achieved by iteratively performing diagonal denoising, whichconcurrently processes a series of consecutive frames with increasing noiselevels in a queue; our method dequeues a fully denoised frame at the head whileenqueuing a new random noise frame at the tail. However, diagonal denoising isa double-edged sword as the frames near the tail can take advantage of cleanerones by forward reference but such a strategy induces the discrepancy betweentraining and inference. Hence, we introduce latent partitioning to reduce thetraining-inference gap and lookahead denoising to leverage the benefit offorward referencing. We have demonstrated the promising results andeffectiveness of the proposed methods on existing text-to-video generationbaselines.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-generation-on-ucf-101 | FIFO-Diffusion | FVD128: 596.64 Inception Score: 74.44 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.