Learning Correlation Structures for Vision Transformers

We introduce a new attention mechanism, dubbed structural self-attention(StructSA), that leverages rich correlation patterns naturally emerging inkey-query interactions of attention. StructSA generates attention maps byrecognizing space-time structures of key-query correlations via convolution anduses them to dynamically aggregate local contexts of value features. Thiseffectively leverages rich structural patterns in images and videos such asscene layouts, object motion, and inter-object relations. Using StructSA as amain building block, we develop the structural vision transformer (StructViT)and evaluate its effectiveness on both image and video classification tasks,achieving state-of-the-art results on ImageNet-1K, Kinetics-400,Something-Something V1 & V2, Diving-48, and FineGym.