2 months ago

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Liu, Xiangcheng ; Wu, Tianyi ; Guo, Guodong

Abstract

Vision transformer has emerged as a new paradigm in computer vision, showingexcellent performance while accompanied by expensive computational cost. Imagetoken pruning is one of the main approaches for ViT compression, due to thefacts that the complexity is quadratic with respect to the token number, andmany tokens containing only background regions do not truly contribute to thefinal prediction. Existing works either rely on additional modules to score theimportance of individual tokens, or implement a fixed ratio pruning strategyfor different input instances. In this work, we propose an adaptive sparsetoken pruning framework with a minimal cost. Specifically, we firstly proposean inexpensive attention head importance weighted class attention scoringmechanism. Then, learnable parameters are inserted as thresholds to distinguishinformative tokens from unimportant ones. By comparing token attention scoresand thresholds, we can discard useless tokens hierarchically and thusaccelerate inference. The learnable thresholds are optimized in budget-awaretraining to balance accuracy and complexity, performing the correspondingpruning configurations for different input instances. Extensive experimentsdemonstrate the effectiveness of our approach. Our method improves thethroughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy, whichachieves a better trade-off between accuracy and latency than the previousmethods.