Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Recent work showed that large diffusion models can be reused as highlyprecise monocular depth estimators by casting depth estimation as animage-conditional image generation task. While the proposed model achievedstate-of-the-art results, high computational demands due to multi-stepinference limited its use in many scenarios. In this paper, we show that theperceived inefficiency was caused by a flaw in the inference pipeline that hasso far gone unnoticed. The fixed model performs comparably to the bestpreviously reported configuration while being more than 200times faster. Tooptimize for downstream task performance, we perform end-to-end fine-tuning ontop of the single-step model with task-specific losses and get a deterministicmodel that outperforms all other diffusion-based depth and normal estimationmodels on common zero-shot benchmarks. We surprisingly find that thisfine-tuning protocol also works directly on Stable Diffusion and achievescomparable performance to current state-of-the-art diffusion-based depth andnormal estimation models, calling into question some of the conclusions drawnfrom prior works.