PGformer: Proxy-Bridged Game Transformer for Multi-Person Highly Interactive Extreme Motion Prediction

Multi-person motion prediction is a challenging task, especially forreal-world scenarios of highly interacted persons. Most previous works havebeen devoted to studying the case of weak interactions (e.g., walkingtogether), in which typically forecasting each human pose in isolation canstill achieve good performances. This paper focuses on collaborative motionprediction for multiple persons with extreme motions and attempts to explorethe relationships between the highly interactive persons' pose trajectories.Specifically, a novel cross-query attention (XQA) module is proposed tobilaterally learn the cross-dependencies between the two pose sequencestailored for this situation. A proxy unit is additionally introduced to bridgethe involved persons, which cooperates with our proposed XQA module and subtlycontrols the bidirectional spatial information flows. These designs are thenintegrated into a Transformer-based architecture and the resulting model iscalled Proxy-bridged Game Transformer (PGformer) for multi-person interactivemotion prediction. Its effectiveness has been evaluated on the challenging ExPIdataset, which involves highly interactive actions. Our PGformer consistentlyoutperforms the state-of-the-art methods in both short- and long-termpredictions by a large margin. Besides, our approach can also be compatiblewith the weakly interacted CMU-Mocap and MuPoTS-3D datasets and extended to thecase of more than 2 individuals with encouraging results.