A Context-Aware Loss Function for Action Spotting in Soccer Videos

In video understanding, action spotting consists in temporally localizinghuman-induced events annotated with single timestamps. In this paper, wepropose a novel loss function that specifically considers the temporal contextnaturally present around each action, rather than focusing on the singleannotated frame to spot. We benchmark our loss on a large dataset of soccervideos, SoccerNet, and achieve an improvement of 12.8% over the baseline. Weshow the generalization capability of our loss for generic activity proposalsand detection on ActivityNet, by spotting the beginning and the end of eachactivity. Furthermore, we provide an extended ablation study and displaychallenging cases for action spotting in soccer videos. Finally, wequalitatively illustrate how our loss induces a precise temporal understandingof actions and show how such semantic knowledge can be used for automatichighlights generation.