HyperAIHyperAI

Command Palette

Search for a command to run...

3年前

AeroGen: 拡散駆動データ生成によるリモートセンシング物体検出の強化

Datao Tang Xiangyong Cao Xuan Wu Jialin Li Jing Yao Xueru Bai Dongsheng Jiang Yin Li Deyu Meng

データ分析、データ拡張、およびResNetニューラルネットワーク

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)
ノートブックへ移動

概要

タイトル:レイアウト制御可能な拡散生成モデル(AeroGen)を用いたリモートセンシング画像物体検出のための合成データ生成

抄録:リモートセンシング画像物体検出(RSIOD)は、衛星画像や航空画像内の特定の物体を同定し、その位置を特定することを目的としている。しかし、現在のRSIODデータセットではラベル付きデータの不足が深刻であり、これが既存の検出アルゴリズムの性能を著しく制限している。データ拡張や半教師あり学習などの既存技術はある程度このデータ不足問題を緩和できるものの、これらは高品質なラベル付きデータに大きく依存しており、稀な物体クラスにおいては性能が低下する傾向にある。この課題に対処するため、本論文ではRSIODに特化したレイアウト制御可能な拡散生成モデル(すなわちAeroGen)を提案する。我々の知る限り、AeroGenは水平方向および回転した境界ボックスの条件生成を同時にサポートする初のモデルであり、特定のレイアウトおよび物体カテゴリの要件を満たす高品質な合成画像の生成を可能にする。生成データの多様性と品質を向上させるためのメカニズムも導入されている。実験結果は、本手法によって生成された合成データが高品質かつ多様であることを示している。さらに、合成RSIODデータは既存のRSIODモデルの検出性能を大幅に向上させることが示され、DIOR、DIOR-R、HRSCデータセットにおけるmAP指標はそれぞれ3.7%、4.3%、2.43%向上した。

One-sentence Summary

The authors propose AeroGen, a layout-controllable diffusion generative model that generates high-quality synthetic training data for remote sensing object detection by simultaneously conditioning on horizontal and rotated bounding boxes, which improves mAP on the DIOR, DIOR-R, and HRSC datasets by 3.7%, 4.3%, and 2.43%, respectively.

Key Contributions

  • The paper introduces AeroGen, a layout-controllable diffusion generative model tailored for remote sensing image object detection that simultaneously supports horizontal and rotated bounding box conditioning to synthesize high-quality images with specific spatial layouts.
  • A diversity-conditioned generator is combined with a targeted filtering mechanism to optimize the variety and fidelity of synthetic data, enabling efficient end-to-end data augmentation without relying on instance-pasting pipelines.
  • Benchmark evaluations on the DIOR, DIOR-R, and HRSC datasets demonstrate that training detection models with this synthetic data improves mean average precision by 3.7%, 4.3%, and 2.43%, respectively.

Introduction

Remote sensing image object detection enables critical analysis of satellite and aerial imagery, but its advancement is consistently hindered by a severe shortage of high-quality labeled training data. Existing generative and augmentation techniques often rely heavily on abundant real annotations, perform poorly on rare object categories, and lack the precise spatial control required for the rotated and horizontal bounding boxes typical in aerial scenes. The authors leverage a layout-controllable diffusion model called AeroGen to directly synthesize high-fidelity remote sensing images conditioned on specific object layouts. By integrating a diversity-conditioned generator with a quality-aware filtering mechanism, their end-to-end framework overcomes prior limitations and delivers synthetic training data that substantially boosts detection accuracy across standard benchmarks.

Dataset

  • Dataset Composition and Sources: The authors use three remote sensing datasets: DIOR, DIOR-R, and HRSC. DIOR and DIOR-R share identical imagery but differ in annotation formats, with DIOR utilizing standard bounding boxes and DIOR-R employing rotated bounding boxes. HRSC serves as a dedicated ship detection dataset.
  • Subset Details and Splits: HRSC comprises 436 training, 181 evaluation, and 444 test frames with resolutions spanning 300x300 to 1500x900 pixels. The DIOR and DIOR-R collections are partitioned into training, validation, and testing sets at a 1:1:2 ratio. All generative training relies exclusively on the training splits.
  • Processing and Filtering: The authors generate synthetic data by fitting a conditional diffusion model to expand layout conditions. They then apply two automated filters to remove low-quality synthetic conditions and images, enforcing strict semantic and layout consistency before integration. Cropping strategies and explicit metadata construction steps are not detailed in the provided text.
  • Usage and Training Configuration: The filtered synthetic images are combined with real data to augment the training set for downstream object detection. The authors train the AeroGen model separately on each dataset for 100 epochs using an AdamW optimizer at a 1e-5 learning rate. Only the UNet attention layers and Layout Mask Attention modules are updated, while the remaining weights stay frozen from a pretrained remote sensing diffusion checkpoint.

Method

The authors leverage a two-component framework for generating high-quality remote sensing images conditioned on layout constraints. The primary component is a layout-conditional diffusion model, which integrates both global text guidance and precise layout control to generate images with specified object placements. This model is built upon a fine-tuned latent diffusion model (LDM) adapted for remote sensing tasks. The layout control is achieved through a dual cross-attention mechanism that fuses global text conditions with localized layout information. The global text prompt is processed by a frozen CLIP text encoder to produce semantic embeddings, which serve as the global conditioning signal. Concurrently, layout information is encoded using a combination of Fourier encoding and category-specific embeddings. Each object's bounding box, whether axis-aligned or rotated, is represented as a list of eight coordinates, which are then Fourier encoded to convert positional data into a frequency-domain vector. This encoded positional representation is concatenated with the category embedding obtained from the CLIP encoder and passed through a linear layer to generate layout control tokens. These tokens are injected into the diffusion process via a dual cross-attention module, where they modulate the attention mechanism to guide the generation process. The output of the model is a weighted sum of the global and layout-conditioned attention outputs, allowing the model to balance both high-level semantic guidance and precise spatial layout.

As shown in the figure below: The layout embedding module combines bounding box coordinates with vectorized semantic information using Fourier and MLP layers. This encodes layout information to facilitate control, with the prompt description processed by a CLIP text encoder for global conditional guidance. The injection of layout information at the noise level is demonstrated, where a local mask governs the injection position of the layout information, allowing for finer layout control. The overall architecture and training process of AeroGen is illustrated, where at each timestep, the image being denoised first passes through a layout information injection module, which enhances layout conditional guidance. The model's architecture integrates a residual block and self-attention layers, with the layout control being applied through a layout mask attention mechanism that uses a binary mask to guide the attention computation, enabling precise manipulation of local noise characteristics during the diffusion generation process.

The second component of the framework is a generative pipeline that produces diverse and high-quality synthetic data by combining a diffusion-based generator with a data filtering mechanism. This pipeline operates in five stages: label generation, label filtering, image generation, image filtering, and data augmentation. In the label generation stage, a denoising diffusion probabilistic model (DDPM) is used to learn the conditional distribution of layout labels, which are represented as a matrix with dimensions H×W×NH \times W \times NH×W×N, where HHH and WWW are the image dimensions and NNN is the number of object categories. Each element in the matrix is set to 1 if the pixel belongs to a target region of a specific category and -1 otherwise. The DDPM generator samples from this distribution to produce synthetic layout labels. These labels are then passed through a filtering mechanism based on Gaussian distributions, which ensures that the generated bounding box attributes, such as area, conform to realistic distributions by applying a threshold based on the standard deviation. This filtering step helps to exclude implausible or low-quality layout conditions.

As shown in the figure below: The generative pipeline begins with label generation, where a denoising diffusion model samples synthetic labels. These labels are then filtered using a Gaussian distribution-based mechanism to ensure they are realistic. The filtered labels are augmented and used to guide the image generation process, where the layout-guided diffusion model produces synthetic images. The generated images undergo a quality assessment based on both semantic and layout consistency. Semantic consistency is evaluated using the CLIP model, while layout consistency is assessed using a ResNet101-based classifier. Images that meet predefined thresholds for both quality and consistency are selected for the final dataset. The pipeline concludes with data augmentation, where the synthetic images are combined with real images to train downstream object detection models. This process ensures that the synthetic data is both diverse and semantically consistent, enhancing the overall performance of the target detection models.

Experiment

The evaluation assessed AeroGen’s generative capabilities and its effectiveness as a data augmentation tool for downstream remote sensing object detection tasks across multiple benchmark datasets. Comparative analyses demonstrate that the model consistently produces high-quality imagery with superior layout consistency and enhanced small object rendering, while successfully accommodating rotated bounding boxes. Furthermore, experiments confirm that integrating these synthetic images significantly boosts downstream detection performance, particularly for underrepresented categories, and outperforms traditional augmentation strategies. Ablation studies further validate that specific architectural components and pipeline filtering mechanisms work synergistically to optimize generation quality, establishing AeroGen as a robust solution for enhancing remote sensing vision tasks.

The authors compare different data augmentation strategies, including traditional methods and their proposed AeroGen-based approach, on a downstream object detection task. Results show that combining AeroGen with traditional augmentation techniques leads to the highest performance improvements across metrics. AeroGen combined with traditional augmentation methods outperforms individual strategies. The integration of AeroGen with Flip and CopyPaste achieves the best results on both mAP and mAP50 metrics. Traditional augmentation methods alone show lower performance compared to the proposed method with synthetic data.

The authors conducted experiments to evaluate the effectiveness of synthetic data generated by AeroGen for improving downstream object detection tasks. Results show that adding synthetic data consistently enhances performance across different datasets, with improvements becoming more pronounced as the amount of synthetic data increases. The benefits are especially notable for rare categories, where the gains are substantial. Adding synthetic data significantly improves performance on downstream object detection tasks. Performance increases with the amount of synthetic data, showing consistent improvements across metrics. The benefits of synthetic data are most pronounced for rare categories, leading to substantial gains in detection performance.

The authors evaluate the performance of AeroGen in generating images from layout conditions on multiple datasets and modalities, comparing it with state-of-the-art methods. Results show that AeroGen achieves superior performance across all metrics, particularly in handling rotated bounding boxes and generating high-quality images that enhance downstream object detection tasks. The effectiveness of synthetic data generated by AeroGen is further validated through data augmentation experiments, where it consistently improves detection performance, especially in rare categories. AeroGen outperforms existing layout-to-image generation methods across multiple metrics and datasets, including those with rotated bounding boxes. Synthetic data generated by AeroGen significantly enhances downstream object detection performance, particularly for rare categories. The integration of specific modules like Layout Mask Attention and Dual Cross Attention improves image quality and layout consistency in generated outputs.

The authors evaluate the impact of synthetic data generated by AeroGen on downstream object detection tasks using the DIOR-R dataset. Results show that increasing the amount of synthetic data leads to improvements in detection performance across both mAP and mAP50 metrics, with the highest gains observed at larger data scales. The addition of synthetic data consistently enhances model performance compared to baseline conditions without augmentation. Increasing the amount of synthetic data improves detection performance on the DIOR-R dataset. The best performance is achieved with the largest scale of synthetic data, showing significant gains in both mAP and mAP50. Synthetic data augmentation consistently outperforms the baseline without augmentation across evaluation metrics.

The authors conducted experiments to evaluate the generative capabilities of AeroGen, focusing on layout-to-image generation and its effectiveness in data augmentation for downstream object detection tasks. Results show that AeroGen outperforms existing methods in generating high-quality images consistent with input layouts, and the synthetic data significantly improves detection performance, particularly for rare categories. The model's effectiveness is further validated through ablation studies that highlight the contributions of key modules and pipeline components. AeroGen outperforms state-of-the-art methods in generating images consistent with layout conditions across multiple datasets. Synthetic data generated by AeroGen significantly improves downstream object detection performance, especially for rare categories. Ablation studies confirm that both layout mask attention and dual cross attention modules enhance image quality and detection performance.

The evaluation setup tests AeroGen’s layout-to-image generation capabilities and its effectiveness as a data augmentation strategy for downstream object detection across multiple datasets. Comparative benchmarks validate that the model produces higher quality, layout-consistent images than existing methods, particularly for rotated bounding boxes. Augmentation experiments confirm that integrating synthetic data with traditional techniques yields the strongest performance gains, which scale positively with data volume and are especially beneficial for rare object categories. Finally, ablation studies validate the critical role of dedicated attention modules in preserving spatial accuracy and boosting overall detection reliability.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています