HyperAIHyperAI

Command Palette

Search for a command to run...

Scalable Vision Transformers with Hierarchical Pooling

Zizheng Pan Bohan Zhuang† Jing Liu Haoyu He Jianfei Cai

Abstract

The recently proposed Visual image Transformers (ViT) with pure attentionhave achieved promising performance on image recognition tasks, such as imageclassification. However, the routine of the current ViT model is to maintain afull-length patch sequence during inference, which is redundant and lackshierarchical representation. To this end, we propose a Hierarchical VisualTransformer (HVT) which progressively pools visual tokens to shrink thesequence length and hence reduces the computational cost, analogous to thefeature maps downsampling in Convolutional Neural Networks (CNNs). It brings agreat benefit that we can increase the model capacity by scaling dimensions ofdepth/width/resolution/patch size without introducing extra computationalcomplexity due to the reduced sequence length. Moreover, we empirically findthat the average pooled visual tokens contain more discriminative informationthan the single class token. To demonstrate the improved scalability of ourHVT, we conduct extensive experiments on the image classification task. Withcomparable FLOPs, our HVT outperforms the competitive baselines on ImageNet andCIFAR-100 datasets. Code is available at https://github.com/MonashAI/HVT


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp