Improving Action Quality Assessment using Weighted Aggregation

Action quality assessment (AQA) aims at automatically judging human actionbased on a video of the said action and assigning a performance score to it.The majority of works in the existing literature on AQA divide RGB videos intoshort clips, transform these clips to higher-level representations usingConvolutional 3D (C3D) networks, and aggregate them through averaging. Thesehigher-level representations are used to perform AQA. We find that the currentclip level feature aggregation technique of averaging is insufficient tocapture the relative importance of clip level features. In this work, wepropose a learning-based weighted-averaging technique. Using this technique,better performance can be obtained without sacrificing too much computationalresources. We call this technique Weight-Decider(WD). We also experiment withResNets for learning better representations for action quality assessment. Weassess the effects of the depth and input clip size of the convolutional neuralnetwork on the quality of action score predictions. We achieve a newstate-of-the-art Spearman's rank correlation of 0.9315 (an increase of 0.45%)on the MTL-AQA dataset using a 34 layer (2+1)D ResNet with the capability ofprocessing 32 frame clips, with WD aggregation.