8 months ago

Abstract

Current fully-supervised video datasets consist of only a few hundredthousand videos and fewer than a thousand domain-specific labels. This hindersthe progress towards advanced video architectures. This paper presents anin-depth study of using large volumes of web videos for pre-training videomodels for the task of action recognition. Our primary empirical finding isthat pre-training at a very large scale (over 65 million videos), despite onnoisy social-media videos and hashtags, substantially improves thestate-of-the-art on three challenging public action recognition datasets.Further, we examine three questions in the construction of weakly-supervisedvideo action datasets. First, given that actions involve interactions withobjects, how should one construct a verb-object pre-training label space tobenefit transfer learning the most? Second, frame-based models perform quitewell on action recognition; is pre-training for good image features sufficientor is pre-training for spatio-temporal features valuable for optimal transferlearning? Finally, actions are generally less well-localized in long videos vs.short videos; since action labels are provided at a video level, how should onechoose video clips for best performance, given some fixed budget of number orminutes of videos?

Source PDF