AbstractThis thesis addresses the task of photo-realistic semantic image editing, where the goal is to provide intuitive controls to modify the content of an image, such that the result is indistinguishable from a real image. In particular the focus is on editing applied to human faces, although, the proposed models can be readily applied to other type of images. We build on recently proposed deep generative models, which allow learning the image editing operations from data. However, there are a number of limitations in these models, two of which are explored in this thesis: the difficulty of modelling high-frequency image details, and the inability to edit images at arbitrarily high resolutions.
The difficulty of modelling high-frequency image details is typical of methods with explicit likelihoods. This work presents a novel approach to overcome this problem. This is achieved by surpassing the common assumption that the pixels in the image noise distribution are independent. In most scenarios, breaking away from this independence assumption leads to a significant increase in computational costs. Additionally, it introduces issues in the estimability of the distribution due to the considerable increment in the number of parameters to be estimated. To overcome these obstacles, we present a tractable approach for a correlated multivariate Gaussian data likelihood, based on sparse inverse covariance matrices. This approach is demonstrated on variational autoencoder (VAE) networks.
An approach to perform image edits using generative adversarial networks (GAN) at arbitrarily high-resolutions is also proposed. The method relies on restricting the types of edits to smooth warps, i.e. geometric deformations of the input image. These warps can be efficiently learned and predicted at a lower resolution, and easily upsampled to be applied at arbitrary resolutions with minimal loss of fidelity. Moreover, paired data is not needed for training the method, i.e. example images of the same subject with different semantic attributes. The model offers several advantages with respect to previous approaches that directly predict the pixel values: the edits are more interpretable, the image content is better preserved, and partial edits can be easily applied.
|Date of Award||24 Jun 2020|
|Sponsors||Anthropics Technology Ltd|
|Supervisor||Yongliang Yang (Supervisor), Neill Campbell (Supervisor), Ivor Simpson (Supervisor) & Sara Vicente (Supervisor)|
- machine learning
- deep neural networks
- image editing