2 months ago

Taming Transformers for High-Resolution Image Synthesis

Esser, Patrick ; Rombach, Robin ; Ommer, Björn

Abstract

Designed to learn long-range interactions on sequential data, transformerscontinue to show state-of-the-art results on a wide variety of tasks. Incontrast to CNNs, they contain no inductive bias that prioritizes localinteractions. This makes them expressive, but also computationally infeasiblefor long sequences, such as high-resolution images. We demonstrate howcombining the effectiveness of the inductive bias of CNNs with the expressivityof transformers enables them to model and thereby synthesize high-resolutionimages. We show how to (i) use CNNs to learn a context-rich vocabulary of imageconstituents, and in turn (ii) utilize transformers to efficiently model theircomposition within high-resolution images. Our approach is readily applied toconditional synthesis tasks, where both non-spatial information, such as objectclasses, and spatial information, such as segmentations, can control thegenerated image. In particular, we present the first results onsemantically-guided synthesis of megapixel images with transformers and obtainthe state of the art among autoregressive models on class-conditional ImageNet.Code and pretrained models can be found athttps://github.com/CompVis/taming-transformers .