All Tokens Matter: Token Labeling for Training Better Vision Transformers

In this paper, we present token labeling -- a new training objective fortraining high-performance vision transformers (ViTs). Different from thestandard training objective of ViTs that computes the classification loss on anadditional trainable class token, our proposed one takes advantage of all theimage patch tokens to compute the training loss in a dense manner.Specifically, token labeling reformulates the image classification problem intomultiple token-level recognition problems and assigns each patch token with anindividual location-specific supervision generated by a machine annotator.Experiments show that token labeling can clearly and consistently improve theperformance of various ViT models across a wide spectrum. For a visiontransformer with 26M learnable parameters serving as an example, with tokenlabeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The resultcan be further increased to 86.4% by slightly scaling the model size up to150M, delivering the minimal-sized model among previous models (250M+) reaching86%. We also show that token labeling can clearly improve the generalizationcapability of the pre-trained models on downstream tasks with dense prediction,such as semantic segmentation. Our code and all the training details will bemade publicly available at https://github.com/zihangJiang/TokenLabeling.