2 months ago

Training data-efficient image transformers & distillation through attention

Touvron, Hugo ; Cord, Matthieu ; Douze, Matthijs ; Massa, Francisco ; Sablayrolles, Alexandre ; Jégou, Hervé

Abstract

Recently, neural networks purely based on attention were shown to addressimage understanding tasks such as image classification. However, these visualtransformers are pre-trained with hundreds of millions of images using anexpensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer bytraining on Imagenet only. We train them on a single computer in less than 3days. Our reference vision transformer (86M parameters) achieves top-1 accuracyof 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific totransformers. It relies on a distillation token ensuring that the studentlearns from the teacher through attention. We show the interest of thistoken-based distillation, especially when using a convnet as a teacher. Thisleads us to report results competitive with convnets for both Imagenet (wherewe obtain up to 85.2% accuracy) and when transferring to other tasks. We shareour code and models.