HyperAI

Video Generation On Ucf 101

Metrics

FVD16
Inception Score
KVD16

Results

Performance results of various models on this benchmark

Model Name
FVD16
Inception Score
KVD16
Paper TitleRepository
DIGAN (128x128, class-conditional)46559.6839.6Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks
MCVD (64x64)1143--MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
MAGVIT (AR)265--MAGVIT: Masked Generative Video Transformer
PYoCo (Zero-shot, 64x64, text-conditional)355.1947.76-Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models-
LVDM (256x256, unconditional)552-42Latent Video Diffusion Models for High-Fidelity Long Video Generation
TATS (128x128, class-conditional)33279.28-Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
PixelDance (256x256, text-conditional)242.8242.10-Make Pixels Dance: High-Dynamic Video Generation-
ACDiT90--ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
Lumiere (Zero-shot. 1024x1024, text-conditional)332.4937.54-Lumiere: A Space-Time Diffusion Model for Video Generation
MMVG (128x128, class-conditional)32873.7-Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
MAGVIT (-L-CG, 128x128, class-conditional)76±289.27±0.15-MAGVIT: Masked Generative Video Transformer
Make-A-Video (Zero-shot, 256x256, class-conditional)367.2333-Make-A-Video: Text-to-Video Generation without Text-Video Data
OmniTokenizer-AR191--OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
Make-A-Video (Finetuning, 256x256, class-conditional)81.2582.55-Make-A-Video: Text-to-Video Generation without Text-Video Data
GridDiff (Zero-shot)340.062.88-Grid Diffusion Models for Text-to-Video Generation-
VideoFusion (128x128, unconditional)22072.22-VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
VDM1396-116Latent Video Diffusion Models for High-Fidelity Long Video Generation
VideoAssembler (Zero-shot, 256x256, class-conditional)346.8448.01-MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
Video-LaVIT280.5744.26-Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
MAGVIT-v258±3--Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
0 of 46 row(s) selected.