Adding Conditional Control to Text-to-Image Diffusion Models

We present ControlNet, a neural network architecture to add spatialconditioning controls to large, pretrained text-to-image diffusion models.ControlNet locks the production-ready large diffusion models, and reuses theirdeep and robust encoding layers pretrained with billions of images as a strongbackbone to learn a diverse set of conditional controls. The neuralarchitecture is connected with "zero convolutions" (zero-initializedconvolution layers) that progressively grow the parameters from zero and ensurethat no harmful noise could affect the finetuning. We test various conditioningcontrols, eg, edges, depth, segmentation, human pose, etc, with StableDiffusion, using single or multiple conditions, with or without prompts. Weshow that the training of ControlNets is robust with small (<50k) and large(>1m) datasets. Extensive results show that ControlNet may facilitate widerapplications to control image diffusion models.