Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Spatial-temporal graphs have been widely used by skeleton-based actionrecognition algorithms to model human action dynamics. To capture robustmovement patterns from these graphs, long-range and multi-scale contextaggregation and spatial-temporal dependency modeling are critical aspects of apowerful feature extractor. However, existing methods have limitations inachieving (1) unbiased long-range joint relationship modeling under multi-scaleoperators and (2) unobstructed cross-spacetime information flow for capturingcomplex spatial-temporal dependencies. In this work, we present (1) a simplemethod to disentangle multi-scale graph convolutions and (2) a unifiedspatial-temporal graph convolutional operator named G3D. The proposedmulti-scale aggregation scheme disentangles the importance of nodes indifferent neighborhoods for effective long-range modeling. The proposed G3Dmodule leverages dense cross-spacetime edges as skip connections for directinformation propagation across the spatial-temporal graph. By coupling theseproposals, we develop a powerful feature extractor named MS-G3D based on whichour model outperforms previous state-of-the-art methods on three large-scaledatasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.