DIGAN (128x128, class-conditional) | 465 | 59.68 | 39.6 | Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks | |
MCVD (64x64) | 1143 | - | - | MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | |
MAGVIT (AR) | 265 | - | - | MAGVIT: Masked Generative Video Transformer | |
PYoCo (Zero-shot, 64x64, text-conditional) | 355.19 | 47.76 | - | Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | - |
LVDM (256x256, unconditional) | 552 | - | 42 | Latent Video Diffusion Models for High-Fidelity Long Video Generation | |
TATS (128x128, class-conditional) | 332 | 79.28 | - | Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | |
PixelDance (256x256, text-conditional) | 242.82 | 42.10 | - | Make Pixels Dance: High-Dynamic Video Generation | - |
ACDiT | 90 | - | - | ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer | |
Lumiere (Zero-shot. 1024x1024, text-conditional) | 332.49 | 37.54 | - | Lumiere: A Space-Time Diffusion Model for Video Generation | |
MMVG (128x128, class-conditional) | 328 | 73.7 | - | Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | |
MAGVIT (-L-CG, 128x128, class-conditional) | 76±2 | 89.27±0.15 | - | MAGVIT: Masked Generative Video Transformer | |
Make-A-Video (Zero-shot, 256x256, class-conditional) | 367.23 | 33 | - | Make-A-Video: Text-to-Video Generation without Text-Video Data | |
OmniTokenizer-AR | 191 | - | - | OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation | |
Make-A-Video (Finetuning, 256x256, class-conditional) | 81.25 | 82.55 | - | Make-A-Video: Text-to-Video Generation without Text-Video Data | |
GridDiff (Zero-shot) | 340.0 | 62.88 | - | Grid Diffusion Models for Text-to-Video Generation | - |
VideoFusion (128x128, unconditional) | 220 | 72.22 | - | VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | |
VDM | 1396 | - | 116 | Latent Video Diffusion Models for High-Fidelity Long Video Generation | |
VideoAssembler (Zero-shot, 256x256, class-conditional) | 346.84 | 48.01 | - | MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing | |
Video-LaVIT | 280.57 | 44.26 | - | Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | |
MAGVIT-v2 | 58±3 | - | - | Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | |