HyperAI

Text To Video Generation On Msr Vtt

Métriques

CLIPSIM
FID
FVD

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
CLIPSIM
FID
FVD
Paper TitleRepository
ModelScopeT2V0.293011.09550ModelScope Text-to-Video Technical Report
Video LDM0.2929--Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
TF-T2V0.29918.19441A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
NUWA0.243947.68-NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
PixelDance0.3125-381Make Pixels Dance: High-Dynamic Video Generation-
Make-A-Video0.304913.17-Make-A-Video: Text-to-Video Generation without Text-Video Data
GODIVA0.2402--GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
MMVG0.264423.4-Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
CogVideo (English)0.263123.59-Make-A-Video: Text-to-Video Generation without Text-Video Data
Snap Video (512x288)0.2793-104.0Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis-
VideoPoet0.3123-213VideoPoet: A Large Language Model for Zero-Shot Video Generation
MagicVideo-36.5998MagicVideo: Efficient Video Generation With Latent Diffusion Models-
HiGen0.29478.60406Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Video-LaVIT0.301211.27188.36Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
CogVideo (Chinese)0.2614--Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
VideoComposer0.2932-580VideoComposer: Compositional Video Synthesis with Motion Controllability
Show-10.307213.08538Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Snap Video (288×288)0.2793-110.4Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis-
0 of 18 row(s) selected.