MotionBERT: A Unified Perspective on Learning Human Motion Representations

We present a unified perspective on tackling various human-centric videotasks by learning human motion representations from large-scale andheterogeneous data resources. Specifically, we propose a pretraining stage inwhich a motion encoder is trained to recover the underlying 3D motion fromnoisy partial 2D observations. The motion representations acquired in this wayincorporate geometric, kinematic, and physical knowledge about human motion,which can be easily transferred to multiple downstream tasks. We implement themotion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)neural network. It could capture long-range spatio-temporal relationships amongthe skeletal joints comprehensively and adaptively, exemplified by the lowest3D pose estimation error so far when trained from scratch. Furthermore, ourproposed framework achieves state-of-the-art performance on all threedownstream tasks by simply finetuning the pretrained motion encoder with asimple regression head (1-2 layers), which demonstrates the versatility of thelearned motion representations. Code and models are available athttps://motionbert.github.io/