Spotting Temporally Precise, Fine-Grained Events in Video

We introduce the task of spotting temporally precise, fine-grained events invideo (detecting the precise moment in time events occur). Precise spottingrequires models to reason globally about the full-time scale of actions andlocally to identify subtle frame-to-frame appearance and motion differencesthat identify events during these actions. Surprisingly, we find that topperforming solutions to prior video understanding tasks such as actiondetection and segmentation do not simultaneously meet both requirements. Inresponse, we propose E2E-Spot, a compact, end-to-end model that performs wellon the precise spotting task and can be trained quickly on a single GPU. Wedemonstrate that E2E-Spot significantly outperforms recent baselines adaptedfrom the video action detection, segmentation, and spotting literature to theprecise spotting task. Finally, we contribute new annotations and splits toseveral fine-grained sports action datasets to make these datasets suitable forfuture work on precise spotting.