Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

Video classification researches that have recently attracted attention arethe fields of temporal modeling and 3D efficient architecture. However, thetemporal modeling methods are not efficient or the 3D efficient architecture isless interested in temporal modeling. For bridging the gap between them, wepropose an efficient temporal modeling 3D architecture, called VoV3D, thatconsists of a temporal one-shot aggregation (T-OSA) module and depthwisefactorized component, D(2+1)D. The T-OSA is devised to build a featurehierarchy by aggregating temporal features with different temporal receptivefields. Stacking this T-OSA enables the network itself to model short-range aswell as long-range temporal relationships across frames without any externalmodules. Inspired by kernel factorization and channel factorization, we alsodesign a depthwise spatiotemporal factorization module, named, D(2+1)D thatdecomposes a 3D depthwise convolution into two spatial and temporal depthwiseconvolutions for making our network more lightweight and efficient. By usingthe proposed temporal modeling method (T-OSA), and the efficient factorizedcomponent (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M andVoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling,VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing astate-of-the-art temporal modeling method on both Something-Something andKinetics-400. Furthermore, VoV3D shows better temporal modeling ability than astate-of-the-art efficient 3D architecture, X3D having comparable modelcapacity. We hope that VoV3D can serve as a baseline for efficient videoclassification.