Text to Image Generation with Semantic-Spatial Aware GAN

Wentong Liao, Kai Hu, Michael Ying Yang, Bodo Rosenhahn

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

104 Citations (SciVal)
95 Downloads (Pure)

Abstract

Text-to-image synthesis (T2I) aims to generate photorealistic images which are semantically consistent with the text descriptions. Existing methods are usually built upon conditional generative adversarial networks (GANs) and initialize an image from noise with sentence embedding, and then refine the features with fine-grained word embedding iteratively. A close inspection of their generated images reveals a major limitation: even though the generated image holistically matches the description, individual image regions or parts of somethings are often not recognizable or consistent with words in the sentence, e.g. 'a white crown'. To address this problem, we propose a novel framework Semantic-Spatial Aware GAN for synthesizing images from input text. Concretely, we introduce a simple and effective Semantic-Spatial Aware block, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a semantic mask in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code available at https://github.com/wtliao/text2image.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE
Pages18166-18175
Number of pages10
ISBN (Electronic)9781665469463
DOIs
Publication statusPublished - 27 Sept 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, USA United States
Duration: 19 Jun 202224 Jun 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUSA United States
CityNew Orleans
Period19/06/2224/06/22

Funding

This work has been supported by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innovations (ZDIN) and the Deutsche Forschungsgemein-schaft (DFG) under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

FundersFunder number
Center for Digital Innovations
ZDIN
Deutsche ForschungsgemeinschaftEXC 2122
Bundesministerium für Bildung und Forschung01DD20003

    Keywords

    • Image and video synthesis and generation
    • Vision + language

    ASJC Scopus subject areas

    • Software
    • Computer Vision and Pattern Recognition

    Fingerprint

    Dive into the research topics of 'Text to Image Generation with Semantic-Spatial Aware GAN'. Together they form a unique fingerprint.

    Cite this