Group Contextualization for Video Recognition

Learning discriminative representation from the complex spatio-temporaldynamic space is essential for video recognition. On top of those stylizedspatio-temporal computational units, further refining the learnt feature withaxial contexts is demonstrated to be promising in achieving this goal. However,previous works generally focus on utilizing a single kind of contexts tocalibrate entire feature channels and could hardly apply to deal with diversevideo activities. The problem can be tackled by using pair-wise spatio-temporalattentions to recompute feature response with cross-axis contexts at theexpense of heavy computations. In this paper, we propose an efficient featurerefinement method that decomposes the feature channels into several groups andseparately refines them with different axial contexts in parallel. We referthis lightweight feature calibration as group contextualization (GC).Specifically, we design a family of efficient element-wise calibrators, i.e.,ECal-G/S/T/L, where their axial contexts are information dynamics aggregatedfrom other axes either globally or locally, to contextualize feature channelgroups. The GC module can be densely plugged into each residual layer of theoff-the-shelf video networks. With little computational overhead, consistentimprovement is observed when plugging in GC on different networks. By utilizingcalibrators to embed feature with four different kinds of contexts in parallel,the learnt representation is expected to be more resilient to diverse types ofactivities. On videos with rich temporal variations, empirically GC can boostthe performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to thestate-of-the-art video networks. Code is available athttps://github.com/haoyanbin918/Group-Contextualization.