HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
Text To Video Generation
Text To Video Generation On Msr Vtt
Text To Video Generation On Msr Vtt
المقاييس
CLIPSIM
FID
FVD
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
CLIPSIM
FID
FVD
Paper Title
Repository
ModelScopeT2V
0.2930
11.09
550
ModelScope Text-to-Video Technical Report
Video LDM
0.2929
-
-
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
TF-T2V
0.2991
8.19
441
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
NUWA
0.2439
47.68
-
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
PixelDance
0.3125
-
381
Make Pixels Dance: High-Dynamic Video Generation
-
Make-A-Video
0.3049
13.17
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
GODIVA
0.2402
-
-
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
MMVG
0.2644
23.4
-
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
CogVideo (English)
0.2631
23.59
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Snap Video (512x288)
0.2793
-
104.0
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
-
VideoPoet
0.3123
-
213
VideoPoet: A Large Language Model for Zero-Shot Video Generation
MagicVideo
-
36.5
998
MagicVideo: Efficient Video Generation With Latent Diffusion Models
-
HiGen
0.2947
8.60
406
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Video-LaVIT
0.3012
11.27
188.36
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
CogVideo (Chinese)
0.2614
-
-
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
VideoComposer
0.2932
-
580
VideoComposer: Compositional Video Synthesis with Motion Controllability
Show-1
0.3072
13.08
538
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Snap Video (288×288)
0.2793
-
110.4
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
-
0 of 18 row(s) selected.
Previous
Next