Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos

The objective of action quality assessment is to score sports videos.However, most existing works focus only on video dynamic information (i.e.,motion information) but ignore the specific postures that an athlete isperforming in a video, which is important for action assessment in long videos.In this work, we present a novel hybrid dynAmic-static Context-aware attenTIONNETwork (ACTION-NET) for action assessment in long videos. To learn morediscriminative representations for videos, we not only learn the video dynamicinformation but also focus on the static postures of the detected athletes inspecific frames, which represent the action quality at certain moments, alongwith the help of the proposed hybrid dynamic-static architecture. Moreover, weleverage a context-aware attention module consisting of a temporalinstance-wise graph convolutional network unit and an attention unit for bothstreams to extract more robust stream features, where the former is forexploring the relations between instances and the latter for assigning a properweight to each instance. Finally, we combine the features of the two streams toregress the final video score, supervised by ground-truth scores given byexperts. Additionally, we have collected and annotated the new RhythmicGymnastics dataset, which contains videos of four different types of gymnasticsroutines, for evaluation of action quality assessment in long videos. Extensiveexperimental results validate the efficacy of our proposed method, whichoutperforms related approaches. The codes and dataset are available at\url{https://github.com/lingan1996/ACTION-NET}.