Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

Recognizing interactive action plays an important role in human-robotinteraction and collaboration. Previous methods use late fusion andco-attention mechanism to capture interactive relations, which have limitedlearning capability or inefficiency to adapt to more interacting entities. Withassumption that priors of each entity are already known, they also lackevaluations on a more general setting addressing the diversity of subjects. Toaddress these problems, we propose an Interactive Spatiotemporal TokenAttention Network (ISTA-Net), which simultaneously model spatial, temporal, andinteractive relations. Specifically, our network contains a tokenizer topartition Interactive Spatiotemporal Tokens (ISTs), which is a unified way torepresent motions of multiple diverse entities. By extending the entitydimension, ISTs provide better interactive representations. To jointly learnalong three dimensions in ISTs, multi-head self-attention blocks integratedwith 3D convolutions are designed to capture inter-token correlations. Whenmodeling correlations, a strict entity ordering is usually irrelevant forrecognizing interactive actions. To this end, Entity Rearrangement is proposedto eliminate the orderliness in ISTs for interchangeable entities. Extensiveexperiments on four datasets verify the effectiveness of ISTA-Net byoutperforming state-of-the-art methods. Our code is publicly available athttps://github.com/Necolizer/ISTA-Net