HyperAIHyperAI
17 days ago

Fine-Grained Visual Classification via Internal Ensemble Learning Transformer

{Bin Luo, Bo Jiang, Jiahui Wang, Qin Xu}
Abstract

Recently, vision transformers (ViTs) have been investigated in fine-grained visual recognition (FGVC) and are nowconsidered state of the art. However, most ViT-based works ignorethe different learning performances of the heads in the multihead self-attention (MHSA) mechanism and its layers. To addressthese issues, in this paper, we propose a novel internal ensemblelearning transformer (IELT) for FGVC. The proposed IELTinvolves three main modules: multi-head voting (MHV) module,cross-layer refinement (CLR) module, and dynamic selection (DS)module. To solve the problem of the inconsistent performances ofmultiple heads, we propose the MHV module, which considersall of the heads in each layer as weak learners and votes fortokens of discriminative regions as cross-layer feature based onthe attention maps and spatial relationships. To effectively minethe cross-layer feature and suppress the noise, the CLR moduleis proposed, where the refined feature is extracted and the assistlogits operation is developed for the final prediction. In addition,a newly designed DS module adjusts the token selection numberat each layer by weighting their contributions of the refinedfeature. In this way, the idea of ensemble learning is combinedwith the ViT to improve fine-grained feature representation. Theexperiments demonstrate that our method achieves competitiveresults compared with the state of the art on five popular FGVCdatasets. Source code has been released and can be found athttps://github.com/mobulan/IELT.