Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Fine-grained visual classification (FGVC) involves classifying closelyrelated sub-classes. This task is difficult due to the subtle differencesbetween classes and the high intra-class variance. Moreover, FGVC datasets aretypically small and challenging to gather, thus highlighting a significant needfor effective data augmentation. Recent advancements in text-to-image diffusionmodels offer new possibilities for augmenting classification datasets. Whilethese models have been used to generate training data for classification tasks,their effectiveness in full-dataset training of FGVC models remainsunder-explored. Recent techniques that rely on Text2Image generation or Img2Imgmethods, often struggle to generate images that accurately represent the classwhile modifying them to a degree that significantly increases the dataset'sdiversity. To address these challenges, we present SaSPA: Structure and SubjectPreserving Augmentation. Contrary to recent methods, our method does not usereal images as guidance, thereby increasing generation flexibility andpromoting greater diversity. To ensure accurate class representation, we employconditioning mechanisms, specifically by conditioning on image edges andsubject representation. We conduct extensive experiments and benchmark SaSPAagainst both traditional and recent generative data augmentation methods. SaSPAconsistently outperforms all established baselines across multiple settings,including full dataset training, contextual bias, and few-shot classification.Additionally, our results reveal interesting patterns in using synthetic datafor FGVC models; for instance, we find a relationship between the amount ofreal data used and the optimal proportion of synthetic data. Code is availableat https://github.com/EyalMichaeli/SaSPA-Aug.