8 months ago

Abstract

Vision Transformers (ViTs) take all the image patches as tokens and constructmulti-head self-attention (MHSA) among them. Complete leverage of these imagetokens brings redundant computations since not all the tokens are attentive inMHSA. Examples include that tokens containing semantically meaningless ordistractive image backgrounds do not positively contribute to the ViTpredictions. In this work, we propose to reorganize image tokens during thefeed-forward process of ViT models, which is integrated into ViT duringtraining. For each forward inference, we identify the attentive image tokensbetween MHSA and FFN (i.e., feed-forward network) modules, which is guided bythe corresponding class token attention. Then, we reorganize image tokens bypreserving attentive image tokens and fusing inattentive ones to expeditesubsequent MHSA and FFN computations. To this end, our method EViT improvesViTs from two perspectives. First, under the same amount of input image tokens,our method reduces MHSA and FFN computation for efficient inference. Forinstance, the inference speed of DeiT-S is increased by 50% while itsrecognition accuracy is decreased by only 0.3% for ImageNet classification.Second, by maintaining the same computational cost, our method empowers ViTs totake more image tokens as input for recognition accuracy improvement, where theimage tokens are from higher resolution images. An example is that we improvethe recognition accuracy of DeiT-S by 1% for ImageNet classification at thesame computational cost of a vanilla DeiT-S. Meanwhile, our method does notintroduce more parameters to ViTs. Experiments on the standard benchmarks showthe effectiveness of our method. The code is available athttps://github.com/youweiliang/evit

Source PDF View Code