Structure-Aware Human-Action Generation

Generating long-range skeleton-based human actions has been a challengingproblem since small deviations of one frame can cause a malformed actionsequence. Most existing methods borrow ideas from video generation, whichnaively treat skeleton nodes/joints as pixels of images without considering therich inter-frame and intra-frame structure information, leading to potentialdistorted actions. Graph convolutional networks (GCNs) is a promising way toleverage structure information to learn structure representations. However,directly adopting GCNs to tackle such continuous action sequences both inspatial and temporal spaces is challenging as the action graph could be huge.To overcome this issue, we propose a variant of GCNs to leverage the powerfulself-attention mechanism to adaptively sparsify a complete action graph in thetemporal space. Our method could dynamically attend to important past framesand construct a sparse graph to apply in the GCN framework, well-capturing thestructure information in action sequences. Extensive experimental resultsdemonstrate the superiority of our method on two standard human action datasetscompared with existing methods.