2 months ago

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Rao, Yongming ; Zhao, Wenliang ; Liu, Benlin ; Lu, Jiwen ; Zhou, Jie ; Hsieh, Cho-Jui

Abstract

Attention is sparse in vision transformers. We observe the final predictionin vision transformers is only based on a subset of most informative tokens,which is sufficient for accurate image recognition. Based on this observation,we propose a dynamic token sparsification framework to prune redundant tokensprogressively and dynamically based on the input. Specifically, we devise alightweight prediction module to estimate the importance score of each tokengiven the current features. The module is added to different layers to pruneredundant tokens hierarchically. To optimize the prediction module in anend-to-end manner, we propose an attention masking strategy to differentiablyprune a token by blocking its interactions with other tokens. Benefiting fromthe nature of self-attention, the unstructured sparse tokens are still hardwarefriendly, which makes our framework easy to achieve actual speed-up. Byhierarchically pruning 66% of the input tokens, our method greatly reduces31%~37% FLOPs and improves the throughput by over 40% while the drop ofaccuracy is within 0.5% for various vision transformers. Equipped with thedynamic token sparsification framework, DynamicViT models can achieve verycompetitive complexity/accuracy trade-offs compared to state-of-the-art CNNsand vision transformers on ImageNet. Code is available athttps://github.com/raoyongming/DynamicViT