Command Palette
Search for a command to run...
Fast Vision Transformers with HiLo Attention
Fast Vision Transformers with HiLo Attention
Pan Zizheng ; Cai Jianfei ; Zhuang Bohan
Abstract
Vision Transformers (ViTs) have triggered the most recent and significantbreakthroughs in computer vision. Their efficient designs are mostly guided bythe indirect metric of computational complexity, i.e., FLOPs, which however hasa clear gap with the direct metric such as throughput. Thus, we propose to usethe direct speed evaluation on the target platform as the design principle forefficient ViTs. Particularly, we introduce LITv2, a simple and effective ViTwhich performs favourably against the existing state-of-the-art methods acrossa spectrum of different model sizes with faster speed. At the core of LITv2 isa novel self-attention mechanism, which we dub HiLo. HiLo is inspired by theinsight that high frequencies in an image capture local fine details and lowfrequencies focus on global structures, whereas a multi-head self-attentionlayer neglects the characteristic of different frequencies. Therefore, wepropose to disentangle the high/low frequency patterns in an attention layer byseparating the heads into two groups, where one group encodes high frequenciesvia self-attention within each local window, and another group encodes lowfrequencies by performing global attention between the average-pooledlow-frequency keys and values from each window and each query position in theinput feature map. Benefiting from the efficient design for both groups, weshow that HiLo is superior to the existing attention mechanisms bycomprehensively benchmarking FLOPs, speed and memory consumption on GPUs andCPUs. For example, HiLo is 1.4x faster than spatial reduction attention and1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 servesas a strong backbone for mainstream vision tasks including imageclassification, dense detection and segmentation. Code is available athttps://github.com/ziplab/LITv2.