Object-Centric Image Generation from Layouts

Despite recent impressive results on single-object and single-domain imagegeneration, the generation of complex scenes with multiple objects remainschallenging. In this paper, we start with the idea that a model must be able tounderstand individual objects and relationships between objects in order togenerate complex scenes well. Our layout-to-image-generation method, which wecall Object-Centric Generative Adversarial Network (or OC-GAN), relies on anovel Scene-Graph Similarity Module (SGSM). The SGSM learns representations ofthe spatial relationships between objects in the scene, which lead to ourmodel's improved layout-fidelity. We also propose changes to the conditioningmechanism of the generator that enhance its object instance-awareness. Apartfrom improving image quality, our contributions mitigate two failure modes inprevious approaches: (1) spurious objects being generated without correspondingbounding boxes in the layout, and (2) overlapping bounding boxes in the layoutleading to merged objects in images. Extensive quantitative evaluation andablation studies demonstrate the impact of our contributions, with our modeloutperforming previous state-of-the-art approaches on both the COCO-Stuff andVisual Genome datasets. Finally, we address an important limitation ofevaluation metrics used in previous works by introducing SceneFID -- anobject-centric adaptation of the popular Fr{\'e}chet Inception Distance metric,that is better suited for multi-object images.