SSGVS: Semantic Scene Graph-to-Video Synthesis

Yuren Cong, Jinhui Yi, Bodo Rosenhahn, Michael Ying Yang

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

3 Citations (SciVal)

Abstract

As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023
PublisherIEEE
Pages2555-2565
Number of pages11
ISBN (Electronic)9798350302493
DOIs
Publication statusPublished - 14 Aug 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023 - Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Volume2023-June
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023
Country/TerritoryCanada
CityVancouver
Period18/06/2322/06/23

Funding

Acknowledgements This work was supported by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor (grant no. 01DD20003) and the AI service center KISSKI (grant no. 01IS22093C), ZDIN and DFG under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

FundersFunder number
AI service center KISSKI01IS22093C
ZDIN
Deutsche ForschungsgemeinschaftEXC 2122
Bundesministerium für Bildung und Forschung01DD20003

    ASJC Scopus subject areas

    • Computer Vision and Pattern Recognition
    • Electrical and Electronic Engineering

    Fingerprint

    Dive into the research topics of 'SSGVS: Semantic Scene Graph-to-Video Synthesis'. Together they form a unique fingerprint.

    Cite this