ActionVLAD: Learning spatio-temporal aggregation for action classification

In this work, we introduce a new video representation for actionclassification that aggregates local convolutional features across the entirespatio-temporal extent of the video. We do so by integrating state-of-the-arttwo-stream networks with learnable spatio-temporal feature aggregation. Theresulting architecture is end-to-end trainable for whole-video classification.We investigate different strategies for pooling across space and time andcombining signals from the different streams. We find that: (i) it is importantto pool jointly across space and time, but (ii) appearance and motion streamsare best aggregated into their own separate representations. Finally, we showthat our representation outperforms the two-stream base architecture by a largemargin (13% relative) as well as out-performs other baselines with comparablebase architectures on HMDB51, UCF101, and Charades video classificationbenchmarks.