8 months ago

Image Recognition

Image Classification

Method/Architecture

Computer Vision

Yifan Xu extsuperscript1,3,4* Zhijie Zhang extsuperscript2,3* Mengdan Zhang extsuperscript3 Kekai Sheng extsuperscript3 Ke Li extsuperscript3 Weiming Dong extsuperscript1,4† Liqing Zhang extsuperscript2 Changsheng Xu extsuperscript1,4 Xing Sun extsuperscript3†

Abstract

Vision transformers (ViTs) have recently received explosive popularity, butthe huge computational cost is still a severe issue. Since the computationcomplexity of ViT is quadratic with respect to the input sequence length, amainstream paradigm for computation reduction is to reduce the number oftokens. Existing designs include structured spatial compression that uses aprogressive shrinking pyramid to reduce the computations of large feature maps,and unstructured token pruning that dynamically drops redundant tokens.However, the limitation of existing token pruning lies in two folds: 1) theincomplete spatial structure caused by pruning is not compatible withstructured spatial compression that is commonly used in modern deep-narrowtransformers; 2) it usually requires a time-consuming pre-training procedure.To tackle the limitations and expand the applicable scenario of token pruning,we present Evo-ViT, a self-motivated slow-fast token evolution approach forvision transformers. Specifically, we conduct unstructured instance-wise tokenselection by taking advantage of the simple and effective global classattention that is native to vision transformers. Then, we propose to update theselected informative tokens and uninformative tokens with different computationpaths, namely, slow-fast updating. Since slow-fast updating mechanism maintainsthe spatial structure and information flow, Evo-ViT can accelerate vanillatransformers of both flat and deep-narrow structures from the very beginning ofthe training process. Experimental results demonstrate that our methodsignificantly reduces the computational cost of vision transformers whilemaintaining comparable performance on image classification.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Image Recognition

Image Classification

Method/Architecture

Computer Vision

Yifan Xu extsuperscript1,3,4* Zhijie Zhang extsuperscript2,3* Mengdan Zhang extsuperscript3 Kekai Sheng extsuperscript3 Ke Li extsuperscript3 Weiming Dong extsuperscript1,4† Liqing Zhang extsuperscript2 Changsheng Xu extsuperscript1,4 Xing Sun extsuperscript3†

Abstract

Vision transformers (ViTs) have recently received explosive popularity, butthe huge computational cost is still a severe issue. Since the computationcomplexity of ViT is quadratic with respect to the input sequence length, amainstream paradigm for computation reduction is to reduce the number oftokens. Existing designs include structured spatial compression that uses aprogressive shrinking pyramid to reduce the computations of large feature maps,and unstructured token pruning that dynamically drops redundant tokens.However, the limitation of existing token pruning lies in two folds: 1) theincomplete spatial structure caused by pruning is not compatible withstructured spatial compression that is commonly used in modern deep-narrowtransformers; 2) it usually requires a time-consuming pre-training procedure.To tackle the limitations and expand the applicable scenario of token pruning,we present Evo-ViT, a self-motivated slow-fast token evolution approach forvision transformers. Specifically, we conduct unstructured instance-wise tokenselection by taking advantage of the simple and effective global classattention that is native to vision transformers. Then, we propose to update theselected informative tokens and uninformative tokens with different computationpaths, namely, slow-fast updating. Since slow-fast updating mechanism maintainsthe spatial structure and information flow, Evo-ViT can accelerate vanillatransformers of both flat and deep-narrow structures from the very beginning ofthe training process. Experimental results demonstrate that our methodsignificantly reduces the computational cost of vision transformers whilemaintaining comparable performance on image classification.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp