2 months ago

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Pan, Bowen ; Panda, Rameswar ; Jiang, Yifan ; Wang, Zhangyang ; Feris, Rogerio ; Oliva, Aude

Abstract

The self-attention-based model, transformer, is recently becoming the leadingbackbone in the field of computer vision. In spite of the impressive successmade by transformers in a variety of vision tasks, it still suffers from heavycomputation and intensive memory costs. To address this limitation, this paperpresents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$).We start by observing a large amount of redundant computation, mainly spent onuncorrelated input patches, and then introduce an interpretable module todynamically and gracefully drop these redundant patches. This novel frameworkis then extended to a hierarchical structure, where uncorrelated tokens atdifferent stages are gradually removed, resulting in a considerable shrinkageof computational cost. We include extensive experiments on both image and videotasks, where our method could deliver up to 1.4x speed-up for state-of-the-artmodels like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy.More importantly, contrary to other acceleration approaches, our method isinherently interpretable with substantial visual evidence, making visiontransformer closer to a more human-understandable architecture while beinglighter. We demonstrate that the interpretability that naturally emerged in ourframework can outperform the raw attention learned by the original visualtransformer, as well as those generated by off-the-shelf interpretationmethods, with both qualitative and quantitative results. Project Page:http://people.csail.mit.edu/bpan/ia-red/.