Deep Video Generation, Prediction and Completion of Human Action Sequences

Current deep learning results on video generation are limited while there areonly a few first results on video prediction and no relevant significantresults on video completion. This is due to the severe ill-posedness inherentin these three problems. In this paper, we focus on human action videos, andpropose a general, two-stage deep framework to generate human action videoswith no constraints or arbitrary number of constraints, which uniformly addressthe three problems: video generation given no input frames, video predictiongiven the first few frames, and video completion given the first and lastframes. To make the problem tractable, in the first stage we train a deepgenerative model that generates a human pose sequence from random noise. In thesecond stage, a skeleton-to-image network is trained, which is used to generatea human action video given the complete human pose sequence generated in thefirst stage. By introducing the two-stage strategy, we sidestep the originalill-posed problems while producing for the first time high-quality videogeneration/prediction/completion results of much longer duration. We presentquantitative and qualitative evaluation to show that our two-stage approachoutperforms state-of-the-art methods in video generation, prediction and videocompletion. Our video result demonstration can be viewed athttps://iamacewhite.github.io/supp/index.html