Few-Shot Temporal Action Localization with Query Adaptive Transformer

Existing temporal action localization (TAL) works rely on a large number oftraining videos with exhaustive segment-level annotation, preventing them fromscaling to new classes. As a solution to this problem, few-shot TAL (FS-TAL)aims to adapt a model to a new class represented by as few as a single video.Exiting FS-TAL methods assume trimmed training videos for new classes. However,this setting is not only unnatural actions are typically captured in untrimmedvideos, but also ignores background video segments containing vital contextualcues for foreground action segmentation. In this work, we first propose a newFS-TAL setting by proposing to use untrimmed training videos. Further, a novelFS-TAL model is proposed which maximizes the knowledge transfer from trainingclasses whilst enabling the model to be dynamically adapted to both the newclass and each video of that class simultaneously. This is achieved byintroducing a query adaptive Transformer in the model. Extensive experiments ontwo action localization benchmarks demonstrate that our method can outperformall the state of the art alternatives significantly in both single-domain andcross-domain scenarios. The source code can be found inhttps://github.com/sauradip/fewshotQAT