HyperAIHyperAI

Text To Video Generation On Msr Vtt

Métriques

CLIPSIM
FID
FVD

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
CLIPSIM
FID
FVD
Paper TitleRepository
ModelScopeT2V0.293011.09550ModelScope Text-to-Video Technical Report-
Video LDM0.2929--Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models-
TF-T2V0.29918.19441A Recipe for Scaling up Text-to-Video Generation with Text-free Videos-
NUWA0.243947.68-NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion-
PixelDance0.3125-381Make Pixels Dance: High-Dynamic Video Generation-
Make-A-Video0.304913.17-Make-A-Video: Text-to-Video Generation without Text-Video Data-
GODIVA0.2402--GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions-
MMVG0.264423.4-Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation-
CogVideo (English)0.263123.59-Make-A-Video: Text-to-Video Generation without Text-Video Data-
Snap Video (512x288)0.2793-104.0Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis-
VideoPoet0.3123-213VideoPoet: A Large Language Model for Zero-Shot Video Generation-
MagicVideo-36.5998MagicVideo: Efficient Video Generation With Latent Diffusion Models-
HiGen0.29478.60406Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation-
Video-LaVIT0.301211.27188.36Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization-
CogVideo (Chinese)0.2614--Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models-
VideoComposer0.2932-580VideoComposer: Compositional Video Synthesis with Motion Controllability-
Show-10.307213.08538Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation-
Snap Video (288×288)0.2793-110.4Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis-
0 of 18 row(s) selected.
Text To Video Generation On Msr Vtt | SOTA | HyperAI