한 달 전

Meng Yu Lei Sun Jianhao Zeng Xiangxiang Chu Kun Zhan

초록

Diffusion Probabilistic Models는 다양한 생성 작업(generative tasks)에서 탁월한 성능을 입증해 왔습니다. 그러나 본 연구에서는 이러한 모델들이 흔히 SNR-timestep (SNR-t) 편향(bias) 문제를 겪는다는 점을 발견했습니다. 이 편향은 inference 단계에서 denoising 샘플의 SNR과 그에 대응하는 timestep 사이의 불일치를 의미합니다. 구체적으로, training 과정에서 샘플의 SNR은 timestep과 엄격하게 결합되어 있습니다. 하지만 inference 과정에서는 이러한 대응 관계가 깨지게 되며, 이는 오차 누적으로 이어져 생성 품질을 저하시킵니다.본 논문에서는 이러한 현상을 입증하기 위한 포괄적인 실증적 근거와 이론적 분석을 제공하며, SNR-t 편향을 완화하기 위한 단순하면서도 효과적인 미분 보정(differential correction) 방법을 제안합니다. Diffusion 모델이 역방향 denoising 과정에서 고주파(high-frequency) 세부 사항에 집중하기 전에 일반적으로 저주파(low-frequency) 성분을 먼저 재구성한다는 점에 착안하여, 우리는 샘플을 다양한 주파수 성분으로 분해하고 각 성분에 개별적으로 미분 보정을 적용합니다.광범위한 실험 결과, 제안된 방식은 계산 오버헤드를 거의 발생시키지 않으면서도 다양한 해상도의 데이터셋에서 여러 diffusion 모델(IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, FLUX)의 생성 품질을 유의미하게 향상시킴을 확인했습니다. 코드는 다음 링크에서 확인할 수 있습니다: https://github.com/AMAP-ML/DCW.

One-sentence Summary

To mitigate the Signal-to-Noise Ratio-timestep (SNR-t) bias caused by the misalignment between sample SNR and timesteps during inference, the authors propose a differential correction method that decomposes samples into frequency components to enhance the generation quality of various diffusion models, including IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX, with negligible computational overhead.

Key Contributions

The paper identifies the Signal-to-Noise Ratio-timestep (SNR-t) bias in Diffusion Probabilistic Models and provides a theoretical analysis and empirical evidence to explain how this mismatch between sample SNR and timesteps during inference leads to error accumulation.
A dynamic differential correction method in the wavelet domain is introduced to alleviate this bias by decomposing samples into frequency components and applying targeted corrections based on the model's tendency to reconstruct low-frequency contours before high-frequency details.
Extensive experiments demonstrate that this training-free and plug-and-play approach significantly improves the generation quality of various models, including IDDPM, EDM, and FLUX, across multiple datasets with negligible computational overhead.

Introduction

Diffusion Probabilistic Models (DPMs) have become essential for high-quality generative tasks such as image, audio, and video synthesis. However, these models often suffer from Signal-to-Noise Ratio-timestep (SNR-t) bias, where the signal-to-noise ratio of a sample during inference deviates from the ratio strictly coupled with its timestep during training. This misalignment, caused by cumulative prediction and discretization errors, leads to inaccurate noise predictions and degraded generation quality. The authors identify this fundamental bias through theoretical and empirical analysis and propose a training-free, plug-and-play differential correction method. By applying this correction within the wavelet domain, the authors allow the model to address different frequency components separately, effectively aligning the predicted sample distribution with the ideal perturbed distribution.

Method

The authors leverage a diffusion probabilistic model (DPM) framework, which operates through a forward diffusion process and a reverse denoising process, both modeled as Markov chains. The forward process gradually adds Gaussian noise to a data sample $\boldsymbol{x}_0$ according to a variance schedule $\beta_t$ , resulting in a sequence of noisy samples $\boldsymbol{x}_1, \ldots, \boldsymbol{x}_T$ . This process is defined as $q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0) = \prod_{t=1}^T q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ , where $q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) = \mathcal{N}(\boldsymbol{x}_t; \sqrt{1-\beta_t}\boldsymbol{x}_{t-1}, \beta_t\boldsymbol{I})$ . The reverse process, which is the core of generation, aims to invert this noise addition step-by-step to recover the original data. A neural network, $p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ , is trained to approximate the reverse transition distribution $q(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ , specifically to predict the noise $\boldsymbol{\epsilon}_t$ present at timestep $t$ . This network, $\epsilon_{\theta}(\cdot)$ , is trained to minimize the L2 distance between its predicted noise and the actual noise $\boldsymbol{\epsilon}_t$ , leading to the simple objective $\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \boldsymbol{x}_0, \boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ \left| \left| \boldsymbol{\epsilon}_{\theta}(\boldsymbol{x}_t, t) - \boldsymbol{\epsilon}_t \right| \right|_2^2 \right]$ . Once trained, the model can generate samples by starting from a standard Gaussian noise $\boldsymbol{x}_T$ and iteratively denoising using the learned network $p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ .

The forward and reverse processes of a DPM

The paper identifies a theoretical bias in this process, termed SNR-t bias, where the Signal-to-Noise Ratio (SNR) of the predicted sample $\hat{\boldsymbol{x}}_t$ during inference does not match the expected SNR at its corresponding timestep $t$ . To address this, the authors propose a differential correction method. The core idea is that the difference between the predicted sample $\hat{\boldsymbol{x}}_{t-1}$ and the reconstructed sample $\boldsymbol{x}_{\theta}^0(\hat{\boldsymbol{x}}_t, t)$ (the output of the noise prediction network) contains a directional signal that can be used to guide the prediction toward a more accurate, lower-bias state. This differential signal is defined as $\hat{\boldsymbol{x}}_{t-1} - \boldsymbol{x}_{\theta}^0(\hat{\boldsymbol{x}}_t, t)$ , and its correction is integrated into the denoising step as $\hat{\boldsymbol{x}}_{t-1} = \hat{\boldsymbol{x}}_{t-1} + \lambda_t (\hat{\boldsymbol{x}}_{t-1} - \boldsymbol{x}_{\theta}^0(\hat{\boldsymbol{x}}_t, t))$ , where $\lambda_t$ is a scalar guidance factor. This correction is applied after the initial denoising step to enhance the sample quality without increasing computational cost.

The overall framework of Differential Correction in Wavelet domain (DCW)

To further improve the correction, the authors introduce the Differential Correction in Wavelet Domain (DCW) framework, as shown in the figure above. This approach decomposes the predicted sample $\hat{\boldsymbol{x}}_{t-1}$ and the reconstructed sample $\boldsymbol{x}_{\theta}^0(\hat{\boldsymbol{x}}_t, t)$ into four frequency subbands (LL, LH, HL, HH) using the Discrete Wavelet Transform (DWT). The differential correction is then applied separately to each subband, with the correction term $\lambda_t^f (\hat{\boldsymbol{x}}_{t-1}^f - \boldsymbol{x}_{\theta}^0(\hat{\boldsymbol{x}}_t, t)^f)$ , where $f$ represents the frequency component. This allows for targeted correction, as the low-frequency (LL) components, which capture the global structure, are prioritized early in the process, while the high-frequency components, which contain fine details, are emphasized later. The correction coefficients $\lambda_t^f$ are dynamically adjusted based on the denoising progress, typically using the reverse process variance $\sigma_t$ as a guide. This wavelet-domain correction helps reduce the noise interference present in the differential signal, leading to more effective and focused refinement of the generated samples.

Experiment

Experiments investigate the SNR-t bias by analyzing how mismatched signal-to-noise ratios and timesteps affect network predictions, specifically finding that reverse denoising samples often exhibit lower SNR than intended. To address this, the proposed DCW method is evaluated across various diffusion frameworks and datasets to validate its ability to correct these errors. The results demonstrate that DCW consistently improves generation quality and aesthetic appeal with negligible computational overhead, even when integrated into existing bias-corrected models.

The experiment evaluates the impact of applying differential correction in different domains on generation quality. Results show that combining correction in both high and low frequency wavelet components yields the best performance, outperforming pixel space correction and single frequency corrections. Combined high and low frequency correction achieves the lowest FID scores. High frequency correction alone performs better than low frequency correction. Pixel space correction shows intermediate performance between single frequency and combined approaches.

The the the table presents FID scores for several diffusion models on CIFAR-10 datasets under different sampling steps. The integration of the proposed method consistently reduces FID scores across all models and datasets, indicating improved generation quality. The results demonstrate the effectiveness of the method in enhancing model performance regardless of the baseline model or sampling step count. The proposed method reduces FID scores across all models and datasets, improving generation quality. The improvement is consistent across different sampling steps, indicating robustness. The method enhances performance on both low-step and high-step sampling scenarios.

The the the table compares the signal-to-noise ratio (SNR) for forward and reverse samples in diffusion models. The SNR of reverse samples is shown to be lower than that of forward samples due to the mismatch between the actual SNR of denoised samples and the expected SNR at each timestep, leading to prediction errors during the denoising process. Reverse samples exhibit a lower SNR compared to forward samples at the same timestep. The SNR of reverse samples is influenced by the network's tendency to overestimate predictions when processing lower SNR inputs. The discrepancy in SNR between forward and reverse samples contributes to errors in the denoising trajectory.

SNR comparison between forward and reverse samples

The experiment evaluates the computational overhead of the DCW method across different models and datasets. Results show that DCW introduces minimal additional time cost, with overhead percentages consistently low across all tested configurations. DCW adds negligible computational overhead to baseline models Overhead remains consistently low across different datasets and models The additional time cost is minimal and does not significantly impact generation speed

The authors evaluate their method across multiple diffusion models and datasets, demonstrating consistent improvements in generation quality. Results show that the proposed approach reduces FID scores and enhances Recall across various models and sampling steps, indicating improved fidelity and diversity. The method consistently reduces FID scores and improves Recall across different models and datasets. The approach enhances generation quality for both classic and bias-corrected diffusion models. The improvements are observed across varying sampling steps and model architectures.

These experiments evaluate the effectiveness of differential correction across various domains, sampling steps, and diffusion architectures to enhance generation quality. The results demonstrate that applying correction to both high and low frequency wavelet components yields superior performance and that the method consistently improves fidelity and diversity across different datasets. Furthermore, the analysis confirms that the proposed approach addresses SNR discrepancies in the denoising process while introducing negligible computational overhead.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

한 달 전

Meng Yu Lei Sun Jianhao Zeng Xiangxiang Chu Kun Zhan

초록

One-sentence Summary

Key Contributions

The paper identifies the Signal-to-Noise Ratio-timestep (SNR-t) bias in Diffusion Probabilistic Models and provides a theoretical analysis and empirical evidence to explain how this mismatch between sample SNR and timesteps during inference leads to error accumulation.
A dynamic differential correction method in the wavelet domain is introduced to alleviate this bias by decomposing samples into frequency components and applying targeted corrections based on the model's tendency to reconstruct low-frequency contours before high-frequency details.
Extensive experiments demonstrate that this training-free and plug-and-play approach significantly improves the generation quality of various models, including IDDPM, EDM, and FLUX, across multiple datasets with negligible computational overhead.

Introduction

Method

Experiment

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

Diffusion Probabilistic Models의 SNR-t Bias 규명

Meng Yu Lei Sun Jianhao Zeng Xiangxiang Chu Kun Zhan

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Diffusion Probabilistic Models의 SNR-t Bias 규명

Meng Yu Lei Sun Jianhao Zeng Xiangxiang Chu Kun Zhan

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Diffusion Probabilistic Models의 SNR-t Bias 규명

Meng Yu Lei Sun Jianhao Zeng Xiangxiang Chu Kun Zhan

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters