HyperAIHyperAI
2 months ago

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Xu, Yifan ; Zhang, Zhijie ; Zhang, Mengdan ; Sheng, Kekai ; Li, Ke ; Dong, Weiming ; Zhang, Liqing ; Xu, Changsheng ; Sun, Xing
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Abstract

Vision transformers (ViTs) have recently received explosive popularity, butthe huge computational cost is still a severe issue. Since the computationcomplexity of ViT is quadratic with respect to the input sequence length, amainstream paradigm for computation reduction is to reduce the number oftokens. Existing designs include structured spatial compression that uses aprogressive shrinking pyramid to reduce the computations of large feature maps,and unstructured token pruning that dynamically drops redundant tokens.However, the limitation of existing token pruning lies in two folds: 1) theincomplete spatial structure caused by pruning is not compatible withstructured spatial compression that is commonly used in modern deep-narrowtransformers; 2) it usually requires a time-consuming pre-training procedure.To tackle the limitations and expand the applicable scenario of token pruning,we present Evo-ViT, a self-motivated slow-fast token evolution approach forvision transformers. Specifically, we conduct unstructured instance-wise tokenselection by taking advantage of the simple and effective global classattention that is native to vision transformers. Then, we propose to update theselected informative tokens and uninformative tokens with different computationpaths, namely, slow-fast updating. Since slow-fast updating mechanism maintainsthe spatial structure and information flow, Evo-ViT can accelerate vanillatransformers of both flat and deep-narrow structures from the very beginning ofthe training process. Experimental results demonstrate that our methodsignificantly reduces the computational cost of vision transformers whilemaintaining comparable performance on image classification.

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer | Latest Papers | HyperAI