MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer

The recently proposed data augmentation TransMix employs attention labels tohelp visual transformers (ViT) achieve better robustness and performance.However, TransMix is deficient in two aspects: 1) The image cropping method ofTransMix may not be suitable for ViTs. 2) At the early stage of training, themodel produces unreliable attention maps. TransMix uses unreliable attentionmaps to compute mixed attention labels that can affect the model. To addressthe aforementioned issues, we propose MaskMix and Progressive AttentionLabeling (PAL) in image and label space, respectively. In detail, from theperspective of image space, we design MaskMix, which mixes two images based ona patch-like grid mask. In particular, the size of each mask patch isadjustable and is a multiple of the image patch size, which ensures each imagepatch comes from only one image and contains more global contents. From theperspective of label space, we design PAL, which utilizes a progressive factorto dynamically re-weight the attention weights of the mixed attention label.Finally, we combine MaskMix and Progressive Attention Labeling as our new dataaugmentation method, named MixPro. The experimental results show that ourmethod can improve various ViT-based models at scales on ImageNetclassification (73.8\% top-1 accuracy based on DeiT-T for 300 epochs). Afterbeing pre-trained with MixPro on ImageNet, the ViT-based models alsodemonstrate better transferability to semantic segmentation, object detection,and instance segmentation. Furthermore, compared to TransMix, MixPro also showsstronger robustness on several benchmarks. The code is available athttps://github.com/fistyee/MixPro.