HyperAIHyperAI

Command Palette

Search for a command to run...

3 years ago

Saliency for Fine-grained Object Recognition in Domains with Scarce Training Data

Carola Figueroa Flores Abel Gonzalez-Garcia Joost van de Weijer Bogdan Raducanu

Flower Recognition CNN with Keras

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)
Go to Notebook

Abstract

This paper investigates the role of saliency to improve the classification accuracy of a Convolutional Neural Network (CNN) for the case when scarce training data is available. Our approach consists in adding a saliency branch to an existing CNN architecture which is used to modulate the standard bottom-up visual features from the original image input, acting as an attentional mechanism that guides the feature extraction process. The main aim of the proposed approach is to enable the effective training of a fine-grained recognition model with limited training samples and to improve the performance on the task, thereby alleviating the need to annotate a large dataset. The vast majority of saliency methods are evaluated on their ability to generate saliency maps, and not on their functionality in a complete vision pipeline. Our proposed pipeline allows to evaluate saliency methods for the high-level task of object recognition. We perform extensive experiments on various fine-grained datasets (Flowers, Birds, Cars, and Dogs) under different conditions and show that saliency can considerably improve the network's performance, especially for the case of scarce training data.

One-sentence Summary

By integrating a saliency branch that modulates bottom-up visual features as an attentional mechanism, the proposed CNN considerably improves fine-grained object recognition accuracy on the Flowers, Birds, Cars, and Dogs datasets, particularly when training samples are scarce, thereby validating saliency methods within complete vision pipelines rather than restricting evaluation to map generation alone.

Key Contributions

  • Introduces a convolutional neural network architecture that integrates a dedicated saliency branch to modulate standard bottom-up visual features as an attentional mechanism.
  • Establishes a complete vision pipeline that evaluates saliency generation methods by measuring their direct impact on high-level object recognition performance rather than relying solely on traditional saliency map quality metrics.
  • Demonstrates through extensive experiments on the Flowers, Birds, Cars, and Dogs datasets that the proposed architecture significantly improves classification accuracy, particularly under limited training data conditions.

Introduction

Fine-grained object recognition requires distinguishing highly similar subclasses, a task that traditionally demands expensive expert annotations and large labeled datasets to capture subtle visual differences. While computational saliency methods effectively highlight visually prominent regions, prior work primarily optimizes these models for map accuracy or human gaze prediction rather than measuring their actual impact on downstream classification. Additionally, existing attention-based neural networks typically require learning new parameters from scratch, making them unstable and prone to overfitting when labeled examples are scarce. The authors leverage a pretrained saliency network as a fixed attention module that modulates standard visual features within a dual-branch architecture. By guiding the recognition model to focus on discriminative regions without requiring explicit part annotations, this approach significantly boosts classification accuracy under data-scarce conditions and reduces the need for costly dataset collection.

Dataset

  • Dataset Composition and Sources: The authors evaluate their framework on four standard fine-grained classification benchmarks sourced from established academic repositories.
  • Subset Specifications:
    • Oxford Flower 102 provides 8,189 images across 102 classes, with 40 to 258 samples per category.
    • The Birds dataset contains 11,788 images spanning 200 species, originally equipped with bounding boxes and 15 keypoints, though the authors process the full uncropped images.
    • The Cars dataset offers 16,185 images across 196 classes, already partitioned into roughly equal training and testing portions.
    • Stanford Dogs includes 20,580 images across 120 breeds, with a preprocessing step that removes any images overlapping with ImageNet.
  • Training Protocol and Data Utilization: For each class, the authors enforce a fixed split of five test images, five validation images, and the remainder for training. To measure performance under limited data conditions, they train models on subsets of kkk images per class, where kkk takes values from 1 to 30 and includes the complete available training set. The base AlexNet architecture is pretrained on ImageNet and fine-tuned for 70 epochs using a learning rate of 0.01 and weight decay of 0.003. The authors also validate the pipeline with ResNet-50 and ResNet-152, reporting classification accuracy averaged over five independent random initializations.
  • Processing and Input Engineering: The workflow avoids explicit cropping, operating directly on full images. Although the Birds dataset includes bounding box and keypoint metadata, these annotations are intentionally ignored. Instead, the authors construct an additional saliency input channel by generating attention maps through five established algorithms (iSEEL, SALICON, Itti and Koch, GBVS, and BMS) and two geometric baselines (uniform white and centered Gaussian distributions) to determine whether learned visual attention enhances recognition beyond standard pixel features.

Method

The authors leverage a dual-branch architecture to integrate saliency information into a convolutional neural network (CNN) for fine-grained object classification under conditions of scarce training data. The framework consists of two primary pathways: an RGB branch that processes the original color image and a saliency branch that operates on a precomputed saliency map derived from the same image. These two streams are designed to interact through a modulation mechanism that dynamically adjusts the importance of visual features during feature extraction. The RGB branch follows a standard CNN processing pipeline, while the saliency branch transforms the input saliency map into a modulation image of matching spatial dimensions. This modulation image is then used to scale the feature maps of an intermediate layer in the RGB branch, effectively emphasizing salient regions and de-emphasizing less relevant background areas. The modulated features are subsequently combined with the original features via a skip connection and fed into a shared joint branch, which continues processing through additional layers before reaching the final classification layer. The architecture is designed to be modular and compatible with various base networks, such as AlexNet, ResNet-50, and ResNet-152, and is trained end-to-end to jointly optimize both the classification and modulation components.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp