17 days ago

Guiding a Diffusion Model with a Bad Version of Itself

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, Samuli Laine

Abstract

The primary axes of interest in image-generating diffusion models are imagequality, the amount of variation in the results, and how well the results alignwith a given condition, e.g., a class label or a text prompt. The popularclassifier-free guidance approach uses an unconditional model to guide aconditional model, leading to simultaneously better prompt alignment andhigher-quality images at the cost of reduced variation. These effects seeminherently entangled, and thus hard to control. We make the surprisingobservation that it is possible to obtain disentangled control over imagequality without compromising the amount of variation by guiding generationusing a smaller, less-trained version of the model itself rather than anunconditional model. This leads to significant improvements in ImageNetgeneration, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, usingpublicly available networks. Furthermore, the method is also applicable tounconditional diffusion models, drastically improving their quality.