Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Toward the goal of automatic production for sports broadcasts, a paramounttask consists in understanding the high-level semantic information of the gamein play. For instance, recognizing and localizing the main actions of the gamewould allow producers to adapt and automatize the broadcast production,focusing on the important details of the game and maximizing the spectatorengagement. In this paper, we focus our analysis on action spotting in soccerbroadcast, which consists in temporally localizing the main actions in a soccergame. To that end, we propose a novel feature pooling method based on NetVLAD,dubbed NetVLAD++, that embeds temporally-aware knowledge. Different fromprevious pooling methods that consider the temporal context as a single set topool from, we split the context before and after an action occurs. We arguethat considering the contextual information around the action spot as a singleentity leads to a sub-optimal learning for the pooling module. With NetVLAD++,we disentangle the context from the past and future frames and learn specificvocabularies of semantics for each subsets, avoiding to blend and blur suchvocabulary in time. Injecting such prior knowledge creates more informativepooling modules and more discriminative pooled features, leading into a betterunderstanding of the actions. We train and evaluate our methodology on therecent large-scale dataset SoccerNet-v2, reaching 53.4% Average-mAP for actionspotting, a +12.7% improvement w.r.t the current state-of-the-art.