2 months ago

Pruning Self-attentions into Convolutional Layers in Single Path

He, Haoyu ; Cai, Jianfei ; Liu, Jing ; Pan, Zizheng ; Zhang, Jing ; Tao, Dacheng ; Zhuang, Bohan

Abstract

Vision Transformers (ViTs) have achieved impressive performance over variouscomputer vision tasks. However, modeling global correlations with multi-headself-attention (MSA) layers leads to two widely recognized issues: the massivecomputational resource consumption and the lack of intrinsic inductive bias formodeling local visual patterns. To solve both issues, we devise a simple yeteffective method named Single-Path Vision Transformer pruning (SPViT), toefficiently and automatically compress the pre-trained ViTs into compact modelswith proper locality added. Specifically, we first propose a novelweight-sharing scheme between MSA and convolutional operations, delivering asingle-path space to encode all candidate operations. In this way, we cast theoperation search problem as finding which subset of parameters to use in eachMSA layer, which significantly reduces the computational cost and optimizationdifficulty, and the convolution kernels can be well initialized usingpre-trained MSA parameters. Relying on the single-path space, we introducelearnable binary gates to encode the operation choices in MSA layers.Similarly, we further employ learnable gates to encode the fine-grained MLPexpansion ratios of FFN layers. In this way, our SPViT optimizes the learnablegates to automatically explore from a vast and unified search space andflexibly adjust the MSA-FFN pruning proportions for each individual densemodel. We conduct extensive experiments on two representative ViTs showing thatour SPViT achieves a new SOTA for pruning on ImageNet-1k. For example, ourSPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracygain simultaneously. The source code is available athttps://github.com/ziplab/SPViT.