Event detection in coarsely annotated sports videos via parallel multi receptive field 1D convolutions

In problems such as sports video analytics, it is difficult to obtainaccurate frame level annotations and exact event duration because of thelengthy videos and sheer volume of video data. This issue is even morepronounced in fast-paced sports such as ice hockey. Obtaining annotations on acoarse scale can be much more practical and time efficient. We propose the taskof event detection in coarsely annotated videos. We introduce a multi-towertemporal convolutional network architecture for the proposed task. The network,with the help of multiple receptive fields, processes information at varioustemporal scales to account for the uncertainty with regard to the exact eventlocation and duration. We demonstrate the effectiveness of the multi-receptivefield architecture through appropriate ablation studies. The method isevaluated on two tasks - event detection in coarsely annotated hockey videos inthe NHL dataset and event spotting in soccer on the SoccerNet dataset. The twodatasets lack frame-level annotations and have very distinct event frequencies.Experimental results demonstrate the effectiveness of the network by obtaininga 55% average F1 score on the NHL dataset and by achieving competitiveperformance compared to the state of the art on the SoccerNet dataset. Webelieve our approach will help develop more practical pipelines for eventdetection in sports video.