Scalable Vision Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attentionhave achieved promising performance on image recognition tasks, such as imageclassification. However, the routine of the current ViT model is to maintain afull-length patch sequence during inference, which is redundant and lackshierarchical representation. To this end, we propose a Hierarchical VisualTransformer (HVT) which progressively pools visual tokens to shrink thesequence length and hence reduces the computational cost, analogous to thefeature maps downsampling in Convolutional Neural Networks (CNNs). It brings agreat benefit that we can increase the model capacity by scaling dimensions ofdepth/width/resolution/patch size without introducing extra computationalcomplexity due to the reduced sequence length. Moreover, we empirically findthat the average pooled visual tokens contain more discriminative informationthan the single class token. To demonstrate the improved scalability of ourHVT, we conduct extensive experiments on the image classification task. Withcomparable FLOPs, our HVT outperforms the competitive baselines on ImageNet andCIFAR-100 datasets. Code is available at https://github.com/MonashAI/HVT