Patch Slimming for Efficient Vision Transformers

This paper studies the efficiency problem for visual transformers byexcavating redundant calculation in given networks. The recent transformerarchitecture has demonstrated its effectiveness for achieving excellentperformance on a series of computer vision tasks. However, similar to that ofconvolutional neural networks, the huge computational cost of visiontransformers is still a severe issue. Considering that the attention mechanismaggregates different patches layer-by-layer, we present a novel patch slimmingapproach that discards useless patches in a top-down paradigm. We firstidentify the effective patches in the last layer and then use them to guide thepatch selection process of previous layers. For each layer, the impact of apatch on the final output feature is approximated and patches with less impactwill be removed. Experimental results on benchmark datasets demonstrate thatthe proposed method can significantly reduce the computational costs of visiontransformers without affecting their performances. For example, over 45% FLOPsof the ViT-Ti model can be reduced with only 0.2% top-1 accuracy drop on theImageNet dataset.