2 months ago

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Kong, Zhenglun ; Dong, Peiyan ; Ma, Xiaolong ; Meng, Xin ; Sun, Mengshu ; Niu, Wei ; Shen, Xuan ; Yuan, Geng ; Ren, Bin ; Qin, Minghai ; Tang, Hao ; Wang, Yanzhi

View Paper Details

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Abstract

Recently, Vision Transformer (ViT) has continuously established newmilestones in the computer vision field, while the high computation and memorycost makes its propagation in industrial production difficult. Pruning, atraditional model compression paradigm for hardware efficiency, has been widelyapplied in various DNN structures. Nevertheless, it stays ambiguous on how toperform exclusive pruning on the ViT structure. Considering three key points:the structural characteristics, the internal data pattern of ViTs, and therelated edge device deployment, we leverage the input token sparsity andpropose a computation-aware soft pruning framework, which can be set up onvanilla Transformers of both flatten and CNN-type structures, such asPooling-based ViT (PiT). More concretely, we design a dynamic attention-basedmulti-head token selector, which is a lightweight module for adaptiveinstance-wise token selection. We further introduce a soft pruning technique,which integrates the less informative tokens generated by the selector moduleinto a package token that will participate in subsequent calculations ratherthan being completely discarded. Our framework is bound to the trade-offbetween accuracy and computation constraints of specific edge devices throughour proposed computation-aware training strategy. Experimental results showthat our framework significantly reduces the computation cost of ViTs whilemaintaining comparable performance on image classification. Moreover, ourframework can guarantee the identified model to meet resource specifications ofmobile devices and FPGA, and even achieve the real-time execution of DeiT-T onmobile platforms. For example, our method reduces the latency of DeiT-T to 26ms (26%$\sim $41% superior to existing works) on the mobile device with0.25%$\sim $4% higher top-1 accuracy on ImageNet.