MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics

Long-term human motion can be represented as a series of motionmodes---motion sequences that capture short-term temporal dynamics---withtransitions between them. We leverage this structure and present a novel MotionTransformation Variational Auto-Encoders (MT-VAE) for learning motion sequencegeneration. Our model jointly learns a feature embedding for motion modes (thatthe motion sequence can be reconstructed from) and a feature transformationthat represents the transition of one motion mode to the next motion mode. Ourmodel is able to generate multiple diverse and plausible motion sequences inthe future from the same input. We apply our approach to both facial and fullbody motion, and demonstrate applications like analogy-based motion transferand video synthesis.