2 months ago

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Wu, Xinjian ; Zeng, Fanhu ; Wang, Xiudong ; Chen, Xinghao

Abstract

Vision Transformers (ViTs) have emerged as powerful models in the field ofcomputer vision, delivering superior performance across various vision tasks.However, the high computational complexity poses a significant barrier to theirpractical applications in real-world scenarios. Motivated by the fact that notall tokens contribute equally to the final predictions and fewer tokens bringless computational cost, reducing redundant tokens has become a prevailingparadigm for accelerating vision transformers. However, we argue that it is notoptimal to either only reduce inattentive redundancy by token pruning, or onlyreduce duplicative redundancy by token merging. To this end, in this paper wepropose a novel acceleration framework, namely token Pruning & PoolingTransformers (PPT), to adaptively tackle these two types of redundancy indifferent layers. By heuristically integrating both token pruning and tokenpooling techniques in ViTs without additional trainable parameters, PPTeffectively reduces the model complexity while maintaining its predictiveaccuracy. For example, PPT reduces over 37% FLOPs and improves the throughputby over 45% for DeiT-S without any accuracy drop on the ImageNet dataset. Thecode is available at https://github.com/xjwu1024/PPT andhttps://github.com/mindspore-lab/models/