8 months ago

Abstract

Developing end-to-end action recognition models on long videos is fundamentaland crucial for long-video action understanding. Due to the unaffordable costof end-to-end training on the whole long videos, existing works generally trainmodels on short clips trimmed from long videos. However, this``trimming-then-training'' practice requires action interval annotations forclip-level supervision, i.e., knowing which actions are trimmed into the clips.Unfortunately, collecting such annotations is very expensive and prevents modeltraining at scale. To this end, this work aims to build a weakly supervisedend-to-end framework for training recognition models on long videos, with onlyvideo-level action category labels. Without knowing the precise temporallocations of actions in long videos, our proposed weakly supervised framework,namely AdaptFocus, estimates where and how likely the actions will occur toadaptively focus on informative action clips for end-to-end training. Theeffectiveness of the proposed AdaptFocus framework is demonstrated on threelong-video datasets. Furthermore, for downstream long-video tasks, ourAdaptFocus framework provides a weakly supervised feature extraction pipelinefor extracting more robust long-video features, such that the state-of-the-artmethods on downstream tasks are significantly advanced. We will release thecode and models.

Source PDF