SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

Fine-grained visual categorization (FGVC) aims at recognizing objects fromsimilar subordinate categories, which is challenging and practical for human'saccurate automatic recognition needs. Most FGVC approaches focus on theattention mechanism research for discriminative regions mining while neglectingtheir interdependencies and composed holistic object structure, which areessential for model's discriminative information localization and understandingability. To address the above limitations, we propose the Structure InformationModeling Transformer (SIM-Trans) to incorporate object structure informationinto transformer for enhancing discriminative representation learning tocontain both the appearance information and structure information.Specifically, we encode the image into a sequence of patch tokens and build astrong vision transformer framework with two well-designed modules: (i) thestructure information learning (SIL) module is proposed to mine the spatialcontext relation of significant patches within the object extent with the helpof the transformer's self-attention weights, which is further injected into themodel for importing structure information; (ii) the multi-level featureboosting (MFB) module is introduced to exploit the complementary of multi-levelfeatures and contrastive learning among classes to enhance feature robustnessfor accurate recognition. The proposed two modules are light-weighted and canbe plugged into any transformer network and trained end-to-end easily, whichonly depends on the attention weights that come with the vision transformeritself. Extensive experiments and analyses demonstrate that the proposedSIM-Trans achieves state-of-the-art performance on fine-grained visualcategorization benchmarks. The code is available athttps://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.