8 months ago

Abstract

Existing action quality assessment (AQA) methods mainly learn deeprepresentations at the video level for scoring diverse actions. Due to the lackof a fine-grained understanding of actions in videos, they harshly suffer fromlow credibility and interpretability, thus insufficient for stringentapplications, such as Olympic diving events. We argue that a fine-grainedunderstanding of actions requires the model to perceive and parse actions inboth time and space, which is also the key to the credibility andinterpretability of the AQA technique. Based on this insight, we propose a newfine-grained spatial-temporal action parser named \textbf{FineParser}. Itlearns human-centric foreground action representations by focusing on targetaction regions within each frame and exploiting their fine-grained alignmentsin time and space to minimize the impact of invalid backgrounds during theassessment. In addition, we construct fine-grained annotations of human-centricforeground action masks for the FineDiving dataset, called\textbf{FineDiving-HM}. With refined annotations on diverse target actionprocedures, FineDiving-HM can promote the development of real-world AQAsystems. Through extensive experiments, we demonstrate the effectiveness ofFineParser, which outperforms state-of-the-art methods while supporting moretasks of fine-grained action understanding. Data and code are available at\url{https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024}.

Source PDF View Code