HyperAIHyperAI

Command Palette

Search for a command to run...

RealRestorer: 대규모 이미지 편집 모델을 활용한 일반화 가능한 실세계 이미지 복원을 위한 연구

초록

실제 환경에서 발생하는 열화 (degradation) 하의 이미지 복원은 자율 주행 및 객체 감지와 같은 하류 작업에 있어 매우 중요합니다. 그러나 기존 복원 모델들은 주로 훈련 데이터의 규모와 분포에 제약을 받아 실제 시나리오로의 일반화 성능이 부족한 경우가 많습니다. 최근 대규모 이미지 편집 모델들은 복원 작업에서 뛰어난 일반화 능력을 보여주었는데, 특히 Nano Banana Pro 와 같은 폐쇄형 소스 (closed-source) 모델은 이미지 복원 시 일관성을 유지하면서도 높은 성능을 달성합니다. 그럼에도 불구하고 이러한 대규모 범용 모델을 활용하여 동등한 성능을 구현하려면 방대한 데이터와 계산 비용이 필요합니다.이러한 문제를 해결하기 위해 본 연구에서는 9 가지 일반적인 실제 환경 열화 유형을 포괄하는 대규모 데이터셋을 구축하고, 폐쇄형 대안과의 격차를 줄이기 위해 최첨단 오픈소스 모델을 훈련시켰습니다. 또한, 열화 제거 및 일관성 유지에 초점을 맞춘 464 개의 실제 열화 이미지와 맞춤형 평가 지표를 포함한 RealIR-Bench 를 도입했습니다. 광범위한 실험 결과, 제안된 모델은 오픈소스 방법 중 1 위를 기록하며 최첨단 (SOTA) 성능을 달성함을 입증했습니다.

One-sentence Summary

Researchers from StepFun and Southern University of Science and Technology propose RealRestorer, an open-source model trained on a new large-scale dataset to restore diverse real-world image degradations. This approach narrows the performance gap with closed-source alternatives while introducing RealIR-Bench for rigorous evaluation in autonomous driving and object detection.

Key Contributions

  • The paper introduces RealRestorer, an open-source image restoration model fine-tuned from a large image editing architecture to handle nine common real-world degradation types while achieving state-of-the-art performance comparable to closed-source systems.
  • A comprehensive data generation pipeline is developed to synthesize high-quality training data with diverse and representative degradations, effectively narrowing the gap between synthetic distributions and real-world conditions.
  • RealIR-Bench is presented as a new benchmark containing 464 real-world degraded images and tailored evaluation metrics to assess both degradation removal and consistency preservation in authentic scenarios.

Introduction

Real-world image restoration is essential for critical downstream applications like autonomous driving and object detection, yet existing models struggle to generalize because they rely on limited synthetic training data that fails to capture the complexity of real-world degradations. While large-scale closed-source image editing models demonstrate superior performance, their high computational costs and lack of transparency hinder reproducibility and broader research adoption. To address these challenges, the authors leverage a comprehensive data synthesis pipeline to train RealRestorer, an open-source model that fine-tunes large image editing architectures to achieve state-of-the-art results across nine degradation types. They further introduce RealIR-Bench, a new benchmark featuring authentic degraded images and tailored metrics to better evaluate restoration quality and content consistency without relying on clean references.

Dataset

  • Dataset Composition and Sources The authors construct a comprehensive dataset for nine image restoration tasks by combining two primary sources: Synthetic Degradation Data and Real-World Degradation Data. The synthetic component leverages clean images collected from the internet, while the real-world component sources naturally degraded images from web platforms and high-quality open-source sites like Pexels and Pinterest.

  • Key Details for Each Subset

    • Synthetic Degradation Data: This subset generates paired data by applying specific degradation models to clean images. The authors utilize open-source models like SAM-2 and MiDaS to extract semantic masks and depth cues for realistic synthesis.
      • Blur: Synthesized via temporal averaging of video clips and web-style operations like Gaussian blur.
      • Compression Artifacts: Simulated using JPEG compression and resizing to mimic web effects.
      • Moiré Patterns: Created by fusing 3,000 generated patterns at multiple scales into clean images.
      • Low-Light: Achieved through brightness attenuation, gamma correction, and a specialized model trained on LOL and LSRW datasets.
      • Noise: Uses web-style degradation with added granular and segment-aware noise.
      • Flare: Involves blending over 3,000 collected glare patterns with random flipping.
      • Reflection: Combines portrait images as transmission layers with diverse scenes as reflection layers, following the SynNet pipeline.
      • Haze: Generated using the atmospheric scattering model enhanced with nearly 200 collected haze patterns.
      • Rain: Incorporates physical effects like splashes and perspective distortion alongside 200 real rain patterns and 70K samples from the FoundIR dataset.
    • Real-World Degradation Data: This subset pairs real degraded images with clean references generated by high-performance restoration models. It covers six degradation types (blur, rain, low light, haze, reflection, and flare) that exhibit significant gaps compared to synthetic patterns.
  • Data Usage and Processing The authors employ a rigorous filtering pipeline to ensure data quality and alignment.

    • Filtering: Vision-Language Models (VLMs) and quality assessment models remove watermarked or low-quality images. CLIP filters real-world data based on semantic cues, while Qwen3-VL-8B-Instruct verifies degradation severity.
    • Alignment Checks: The team uses low-level metrics and skeleton-shift-based methods to detect content shifts and alignment errors between degraded and clean pairs.
    • Human Curation: A subset of filtered pairs undergoes manual review by three experts to confirm degradation type and severity alignment.
    • Training Mixture: The final training set combines both synthetic and real-world pairs, with specific statistics provided per degradation type to balance the dataset.
  • Benchmark and Evaluation The authors introduce RealIR-Bench, a test set containing 464 non-reference degraded images sourced directly from the internet. This benchmark covers all nine restoration tasks and includes complex mixed degradations. Evaluation uses a fixed enhancement instruction to minimize instruction variation, focusing on restoration capability and scene consistency. Quality is assessed using metrics like LPIPS, RS, and FS, alongside human-rated scores for enhancement capability and overall visual quality.

Method

The proposed method is built upon the Step1X-Edit base model, which utilizes a Diffusion in Transformer (DiT) backbone effective for generation tasks. The architecture incorporates QwenVL as a text encoder to inject high-level semantic extraction into the DiT denoising pathway. Within the diffusion network, a dual-stream design is employed to jointly process semantic information along with noise and the conditional input image. Both the reference image and the output image are encoded into latent space using Flux-VAE. During the training phase, the Flux-VAE and text encoder are frozen, while only the DiT component is fine-tuned.

The training strategy is divided into two distinct stages to optimize restoration performance. The first stage is a Transfer-training phase designed to transfer high-level knowledge and priors from image editing to image restoration using synthetic paired data. This stage operates at a high resolution of 1024×10241024 \times 10241024×1024 with a constant learning rate of 1e51e^{-5}1e5 and a global batch size of 16. To ensure broad generalization, single and fixed prompts are adopted for each of the nine degradation tasks, and an average sampling ratio is used for multi-task learning.

The second stage involves Supervised Fine-tuning to enhance restoration fidelity and generalization under real-world degradation scenarios. This stage emphasizes adaptation to complex and authentic degradation patterns using a cosine annealing learning rate schedule. A Progressively-Mixed training strategy is adopted, which retains a small proportion of synthetic paired samples alongside real-world data to prevent overfitting and preserve cross-task robustness. Additionally, a web-style degradation data augmentation strategy is introduced to improve robustness against images collected from the web, which often suffer from low visual quality and compression artifacts.

The pipeline addresses nine specific degradation types: blur, compression artifacts, moiré patterns, low-light, noise, flare, reflection, haze, and rain. As shown in the figure below, the data generation process for these diverse degradations involves specific processing steps such as VLMs filtering, Retinexformer for low-light adjustment, and Real-ESRGAN for noise simulation, ultimately producing the degraded images used for training.

Experiment

  • RealIR-Bench evaluation validates that RealRestorer effectively removes diverse real-world degradations while preserving content fidelity, ranking first among open-source models and closely trailing top closed-source systems across nine tasks including deblurring, low-light enhancement, and reflection removal.
  • FoundIR benchmark testing confirms the model achieves superior performance on isolated degradation tasks compared to other image editing models, demonstrating a strong balance between restoration quality and perceptual consistency despite the inherent limitations of generative approaches on reference-based metrics.
  • Zero-shot generalization experiments show the model successfully handles unseen restoration scenarios like snow removal and old photo restoration by leveraging learned priors without specific fine-tuning.
  • Ablation studies establish that a two-stage training strategy combining synthetic and real-world data is essential, as it prevents overfitting and artifacts while ensuring robust generalization and structural consistency.
  • User studies and metric correlation analysis verify that the proposed non-reference evaluation framework aligns well with human judgment, confirming the model's ability to produce visually stable and coherent results.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp