HyperAI
HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Text-zu-Video-Erstellung
Text To Video Generation On Msr Vtt
Text To Video Generation On Msr Vtt
Metriken
CLIPSIM
FID
FVD
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
CLIPSIM
FID
FVD
Paper Title
Repository
ModelScopeT2V
0.2930
11.09
550
ModelScope Text-to-Video Technical Report
-
Video LDM
0.2929
-
-
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
-
TF-T2V
0.2991
8.19
441
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
-
NUWA
0.2439
47.68
-
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
-
PixelDance
0.3125
-
381
Make Pixels Dance: High-Dynamic Video Generation
-
Make-A-Video
0.3049
13.17
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
-
GODIVA
0.2402
-
-
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
-
MMVG
0.2644
23.4
-
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
-
CogVideo (English)
0.2631
23.59
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
-
Snap Video (512x288)
0.2793
-
104.0
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
-
VideoPoet
0.3123
-
213
VideoPoet: A Large Language Model for Zero-Shot Video Generation
-
MagicVideo
-
36.5
998
MagicVideo: Efficient Video Generation With Latent Diffusion Models
-
HiGen
0.2947
8.60
406
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
-
Video-LaVIT
0.3012
11.27
188.36
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
CogVideo (Chinese)
0.2614
-
-
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
-
VideoComposer
0.2932
-
580
VideoComposer: Compositional Video Synthesis with Motion Controllability
-
Show-1
0.3072
13.08
538
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
-
Snap Video (288×288)
0.2793
-
110.4
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
-
0 of 18 row(s) selected.
Previous
Next
Text To Video Generation On Msr Vtt | SOTA | HyperAI