HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 3 ans

Saliency pour la reconnaissance d'objets fine-grained dans des domaines avec des données d'entraînement rares

Carola Figueroa Flores Abel Gonzalez-Garcia Joost van de Weijer Bogdan Raducanu

Reconnaissance de fleurs avec CNN et Keras

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)
Aller à Notebook

Résumé

Cet article examine le rôle de la saillance afin d’améliorer la précision de classification d’un réseau de neurones convolutif (CNN) dans le cas où les données d’entraînement sont rares. Notre approche consiste à ajouter une branche de saillance à une architecture CNN existante, laquelle est utilisée pour moduler les caractéristiques visuelles standard ascendantes issues de l’entrée d’image originale, agissant comme un mécanisme attentionnel qui guide le processus d’extraction de caractéristiques. L’objectif principal de l’approche proposée est de permettre l’entraînement efficace d’un modèle de reconnaissance fine-grained avec un nombre limité d’échantillons d’entraînement et d’améliorer les performances sur la tâche, atténuant ainsi le besoin d’annoter un grand jeu de données. La grande majorité des méthodes de saillance sont évaluées sur leur capacité à générer des cartes de saillance, et non sur leur fonctionnalité au sein d’un pipeline de vision complet. Notre pipeline proposé permet d’évaluer les méthodes de saillance pour la tâche de haut niveau de reconnaissance d’objets. Nous réalisons des expériences approfondies sur divers ensembles de données fine-grained (Flowers, Birds, Cars et Dogs) dans différentes conditions et montrons que la saillance peut considérablement améliorer les performances du réseau, en particulier dans le cas de données d’entraînement rares.

One-sentence Summary

By integrating a saliency branch that modulates bottom-up visual features as an attentional mechanism, the proposed CNN considerably improves fine-grained object recognition accuracy on the Flowers, Birds, Cars, and Dogs datasets, particularly when training samples are scarce, thereby validating saliency methods within complete vision pipelines rather than restricting evaluation to map generation alone.

Key Contributions

  • Introduces a convolutional neural network architecture that integrates a dedicated saliency branch to modulate standard bottom-up visual features as an attentional mechanism.
  • Establishes a complete vision pipeline that evaluates saliency generation methods by measuring their direct impact on high-level object recognition performance rather than relying solely on traditional saliency map quality metrics.
  • Demonstrates through extensive experiments on the Flowers, Birds, Cars, and Dogs datasets that the proposed architecture significantly improves classification accuracy, particularly under limited training data conditions.

Introduction

Fine-grained object recognition requires distinguishing highly similar subclasses, a task that traditionally demands expensive expert annotations and large labeled datasets to capture subtle visual differences. While computational saliency methods effectively highlight visually prominent regions, prior work primarily optimizes these models for map accuracy or human gaze prediction rather than measuring their actual impact on downstream classification. Additionally, existing attention-based neural networks typically require learning new parameters from scratch, making them unstable and prone to overfitting when labeled examples are scarce. The authors leverage a pretrained saliency network as a fixed attention module that modulates standard visual features within a dual-branch architecture. By guiding the recognition model to focus on discriminative regions without requiring explicit part annotations, this approach significantly boosts classification accuracy under data-scarce conditions and reduces the need for costly dataset collection.

Dataset

  • Dataset Composition and Sources: The authors evaluate their framework on four standard fine-grained classification benchmarks sourced from established academic repositories.
  • Subset Specifications:
    • Oxford Flower 102 provides 8,189 images across 102 classes, with 40 to 258 samples per category.
    • The Birds dataset contains 11,788 images spanning 200 species, originally equipped with bounding boxes and 15 keypoints, though the authors process the full uncropped images.
    • The Cars dataset offers 16,185 images across 196 classes, already partitioned into roughly equal training and testing portions.
    • Stanford Dogs includes 20,580 images across 120 breeds, with a preprocessing step that removes any images overlapping with ImageNet.
  • Training Protocol and Data Utilization: For each class, the authors enforce a fixed split of five test images, five validation images, and the remainder for training. To measure performance under limited data conditions, they train models on subsets of kkk images per class, where kkk takes values from 1 to 30 and includes the complete available training set. The base AlexNet architecture is pretrained on ImageNet and fine-tuned for 70 epochs using a learning rate of 0.01 and weight decay of 0.003. The authors also validate the pipeline with ResNet-50 and ResNet-152, reporting classification accuracy averaged over five independent random initializations.
  • Processing and Input Engineering: The workflow avoids explicit cropping, operating directly on full images. Although the Birds dataset includes bounding box and keypoint metadata, these annotations are intentionally ignored. Instead, the authors construct an additional saliency input channel by generating attention maps through five established algorithms (iSEEL, SALICON, Itti and Koch, GBVS, and BMS) and two geometric baselines (uniform white and centered Gaussian distributions) to determine whether learned visual attention enhances recognition beyond standard pixel features.

Method

The authors leverage a dual-branch architecture to integrate saliency information into a convolutional neural network (CNN) for fine-grained object classification under conditions of scarce training data. The framework consists of two primary pathways: an RGB branch that processes the original color image and a saliency branch that operates on a precomputed saliency map derived from the same image. These two streams are designed to interact through a modulation mechanism that dynamically adjusts the importance of visual features during feature extraction. The RGB branch follows a standard CNN processing pipeline, while the saliency branch transforms the input saliency map into a modulation image of matching spatial dimensions. This modulation image is then used to scale the feature maps of an intermediate layer in the RGB branch, effectively emphasizing salient regions and de-emphasizing less relevant background areas. The modulated features are subsequently combined with the original features via a skip connection and fed into a shared joint branch, which continues processing through additional layers before reaching the final classification layer. The architecture is designed to be modular and compatible with various base networks, such as AlexNet, ResNet-50, and ResNet-152, and is trained end-to-end to jointly optimize both the classification and modulation components.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp