HyperAIHyperAI

Command Palette

Search for a command to run...

3 years ago

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

Datao Tang Xiangyong Cao Xuan Wu Jialin Li Jing Yao Xueru Bai Dongsheng Jiang Yin Li Deyu Meng

Data Analysis, Data Augmentation and ResNet Neural Networks

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)
Go to Notebook

Abstract

Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. ing mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively.

One-sentence Summary

The authors propose AeroGen, a layout-controllable diffusion generative model that generates high-quality synthetic training data for remote sensing object detection by simultaneously conditioning on horizontal and rotated bounding boxes, which improves mAP on the DIOR, DIOR-R, and HRSC datasets by 3.7%, 4.3%, and 2.43%, respectively.

Key Contributions

  • The paper introduces AeroGen, a layout-controllable diffusion generative model tailored for remote sensing image object detection that simultaneously supports horizontal and rotated bounding box conditioning to synthesize high-quality images with specific spatial layouts.
  • A diversity-conditioned generator is combined with a targeted filtering mechanism to optimize the variety and fidelity of synthetic data, enabling efficient end-to-end data augmentation without relying on instance-pasting pipelines.
  • Benchmark evaluations on the DIOR, DIOR-R, and HRSC datasets demonstrate that training detection models with this synthetic data improves mean average precision by 3.7%, 4.3%, and 2.43%, respectively.

Introduction

Remote sensing image object detection enables critical analysis of satellite and aerial imagery, but its advancement is consistently hindered by a severe shortage of high-quality labeled training data. Existing generative and augmentation techniques often rely heavily on abundant real annotations, perform poorly on rare object categories, and lack the precise spatial control required for the rotated and horizontal bounding boxes typical in aerial scenes. The authors leverage a layout-controllable diffusion model called AeroGen to directly synthesize high-fidelity remote sensing images conditioned on specific object layouts. By integrating a diversity-conditioned generator with a quality-aware filtering mechanism, their end-to-end framework overcomes prior limitations and delivers synthetic training data that substantially boosts detection accuracy across standard benchmarks.

Dataset

  • Dataset Composition and Sources: The authors use three remote sensing datasets: DIOR, DIOR-R, and HRSC. DIOR and DIOR-R share identical imagery but differ in annotation formats, with DIOR utilizing standard bounding boxes and DIOR-R employing rotated bounding boxes. HRSC serves as a dedicated ship detection dataset.
  • Subset Details and Splits: HRSC comprises 436 training, 181 evaluation, and 444 test frames with resolutions spanning 300x300 to 1500x900 pixels. The DIOR and DIOR-R collections are partitioned into training, validation, and testing sets at a 1:1:2 ratio. All generative training relies exclusively on the training splits.
  • Processing and Filtering: The authors generate synthetic data by fitting a conditional diffusion model to expand layout conditions. They then apply two automated filters to remove low-quality synthetic conditions and images, enforcing strict semantic and layout consistency before integration. Cropping strategies and explicit metadata construction steps are not detailed in the provided text.
  • Usage and Training Configuration: The filtered synthetic images are combined with real data to augment the training set for downstream object detection. The authors train the AeroGen model separately on each dataset for 100 epochs using an AdamW optimizer at a 1e-5 learning rate. Only the UNet attention layers and Layout Mask Attention modules are updated, while the remaining weights stay frozen from a pretrained remote sensing diffusion checkpoint.

Method

The authors leverage a two-component framework for generating high-quality remote sensing images conditioned on layout constraints. The primary component is a layout-conditional diffusion model, which integrates both global text guidance and precise layout control to generate images with specified object placements. This model is built upon a fine-tuned latent diffusion model (LDM) adapted for remote sensing tasks. The layout control is achieved through a dual cross-attention mechanism that fuses global text conditions with localized layout information. The global text prompt is processed by a frozen CLIP text encoder to produce semantic embeddings, which serve as the global conditioning signal. Concurrently, layout information is encoded using a combination of Fourier encoding and category-specific embeddings. Each object's bounding box, whether axis-aligned or rotated, is represented as a list of eight coordinates, which are then Fourier encoded to convert positional data into a frequency-domain vector. This encoded positional representation is concatenated with the category embedding obtained from the CLIP encoder and passed through a linear layer to generate layout control tokens. These tokens are injected into the diffusion process via a dual cross-attention module, where they modulate the attention mechanism to guide the generation process. The output of the model is a weighted sum of the global and layout-conditioned attention outputs, allowing the model to balance both high-level semantic guidance and precise spatial layout.

As shown in the figure below: The layout embedding module combines bounding box coordinates with vectorized semantic information using Fourier and MLP layers. This encodes layout information to facilitate control, with the prompt description processed by a CLIP text encoder for global conditional guidance. The injection of layout information at the noise level is demonstrated, where a local mask governs the injection position of the layout information, allowing for finer layout control. The overall architecture and training process of AeroGen is illustrated, where at each timestep, the image being denoised first passes through a layout information injection module, which enhances layout conditional guidance. The model's architecture integrates a residual block and self-attention layers, with the layout control being applied through a layout mask attention mechanism that uses a binary mask to guide the attention computation, enabling precise manipulation of local noise characteristics during the diffusion generation process.

The second component of the framework is a generative pipeline that produces diverse and high-quality synthetic data by combining a diffusion-based generator with a data filtering mechanism. This pipeline operates in five stages: label generation, label filtering, image generation, image filtering, and data augmentation. In the label generation stage, a denoising diffusion probabilistic model (DDPM) is used to learn the conditional distribution of layout labels, which are represented as a matrix with dimensions H×W×NH \times W \times NH×W×N, where HHH and WWW are the image dimensions and NNN is the number of object categories. Each element in the matrix is set to 1 if the pixel belongs to a target region of a specific category and -1 otherwise. The DDPM generator samples from this distribution to produce synthetic layout labels. These labels are then passed through a filtering mechanism based on Gaussian distributions, which ensures that the generated bounding box attributes, such as area, conform to realistic distributions by applying a threshold based on the standard deviation. This filtering step helps to exclude implausible or low-quality layout conditions.

As shown in the figure below: The generative pipeline begins with label generation, where a denoising diffusion model samples synthetic labels. These labels are then filtered using a Gaussian distribution-based mechanism to ensure they are realistic. The filtered labels are augmented and used to guide the image generation process, where the layout-guided diffusion model produces synthetic images. The generated images undergo a quality assessment based on both semantic and layout consistency. Semantic consistency is evaluated using the CLIP model, while layout consistency is assessed using a ResNet101-based classifier. Images that meet predefined thresholds for both quality and consistency are selected for the final dataset. The pipeline concludes with data augmentation, where the synthetic images are combined with real images to train downstream object detection models. This process ensures that the synthetic data is both diverse and semantically consistent, enhancing the overall performance of the target detection models.

Experiment

The evaluation assessed AeroGen’s generative capabilities and its effectiveness as a data augmentation tool for downstream remote sensing object detection tasks across multiple benchmark datasets. Comparative analyses demonstrate that the model consistently produces high-quality imagery with superior layout consistency and enhanced small object rendering, while successfully accommodating rotated bounding boxes. Furthermore, experiments confirm that integrating these synthetic images significantly boosts downstream detection performance, particularly for underrepresented categories, and outperforms traditional augmentation strategies. Ablation studies further validate that specific architectural components and pipeline filtering mechanisms work synergistically to optimize generation quality, establishing AeroGen as a robust solution for enhancing remote sensing vision tasks.

The authors compare different data augmentation strategies, including traditional methods and their proposed AeroGen-based approach, on a downstream object detection task. Results show that combining AeroGen with traditional augmentation techniques leads to the highest performance improvements across metrics. AeroGen combined with traditional augmentation methods outperforms individual strategies. The integration of AeroGen with Flip and CopyPaste achieves the best results on both mAP and mAP50 metrics. Traditional augmentation methods alone show lower performance compared to the proposed method with synthetic data.

The authors conducted experiments to evaluate the effectiveness of synthetic data generated by AeroGen for improving downstream object detection tasks. Results show that adding synthetic data consistently enhances performance across different datasets, with improvements becoming more pronounced as the amount of synthetic data increases. The benefits are especially notable for rare categories, where the gains are substantial. Adding synthetic data significantly improves performance on downstream object detection tasks. Performance increases with the amount of synthetic data, showing consistent improvements across metrics. The benefits of synthetic data are most pronounced for rare categories, leading to substantial gains in detection performance.

The authors evaluate the performance of AeroGen in generating images from layout conditions on multiple datasets and modalities, comparing it with state-of-the-art methods. Results show that AeroGen achieves superior performance across all metrics, particularly in handling rotated bounding boxes and generating high-quality images that enhance downstream object detection tasks. The effectiveness of synthetic data generated by AeroGen is further validated through data augmentation experiments, where it consistently improves detection performance, especially in rare categories. AeroGen outperforms existing layout-to-image generation methods across multiple metrics and datasets, including those with rotated bounding boxes. Synthetic data generated by AeroGen significantly enhances downstream object detection performance, particularly for rare categories. The integration of specific modules like Layout Mask Attention and Dual Cross Attention improves image quality and layout consistency in generated outputs.

The authors evaluate the impact of synthetic data generated by AeroGen on downstream object detection tasks using the DIOR-R dataset. Results show that increasing the amount of synthetic data leads to improvements in detection performance across both mAP and mAP50 metrics, with the highest gains observed at larger data scales. The addition of synthetic data consistently enhances model performance compared to baseline conditions without augmentation. Increasing the amount of synthetic data improves detection performance on the DIOR-R dataset. The best performance is achieved with the largest scale of synthetic data, showing significant gains in both mAP and mAP50. Synthetic data augmentation consistently outperforms the baseline without augmentation across evaluation metrics.

The authors conducted experiments to evaluate the generative capabilities of AeroGen, focusing on layout-to-image generation and its effectiveness in data augmentation for downstream object detection tasks. Results show that AeroGen outperforms existing methods in generating high-quality images consistent with input layouts, and the synthetic data significantly improves detection performance, particularly for rare categories. The model's effectiveness is further validated through ablation studies that highlight the contributions of key modules and pipeline components. AeroGen outperforms state-of-the-art methods in generating images consistent with layout conditions across multiple datasets. Synthetic data generated by AeroGen significantly improves downstream object detection performance, especially for rare categories. Ablation studies confirm that both layout mask attention and dual cross attention modules enhance image quality and detection performance.

The evaluation setup tests AeroGen’s layout-to-image generation capabilities and its effectiveness as a data augmentation strategy for downstream object detection across multiple datasets. Comparative benchmarks validate that the model produces higher quality, layout-consistent images than existing methods, particularly for rotated bounding boxes. Augmentation experiments confirm that integrating synthetic data with traditional techniques yields the strongest performance gains, which scale positively with data volume and are especially beneficial for rare object categories. Finally, ablation studies validate the critical role of dedicated attention modules in preserving spatial accuracy and boosting overall detection reliability.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp