Controlling Vision-Language Models for Multi-Task Image Restoration

Vision-language models such as CLIP have shown great impact on diversedownstream tasks for zero-shot or label-free predictions. However, when itcomes to low-level vision such as image restoration their performancedeteriorates dramatically due to corrupted inputs. In this paper, we present adegradation-aware vision-language model (DA-CLIP) to better transfer pretrainedvision-language models to low-level vision tasks as a multi-task framework forimage restoration. More specifically, DA-CLIP trains an additional controllerthat adapts the fixed CLIP image encoder to predict high-quality featureembeddings. By integrating the embedding into an image restoration network viacross-attention, we are able to pilot the model to learn a high-fidelity imagereconstruction. The controller itself will also output a degradation featurethat matches the real corruptions of the input, yielding a natural classifierfor different degradation types. In addition, we construct a mixed degradationdataset with synthetic captions for DA-CLIP training. Our approach advancesstate-of-the-art performance on both \emph{degradation-specific} and\emph{unified} image restoration tasks, showing a promising direction ofprompting image restoration with large-scale pretrained vision-language models.Our code is available at https://github.com/Algolzw/daclip-uir.