CMU 16-726: Learning Based Image Synthesis

GAN Photo Editing

Maneesh Bilalpur(mbilalpu)

Overview

In this assignment, we implement use existing GAN models to perform image editing. The goals are to sample the latent vector that corresponding the input and explore its neighbourhood(the interpolation task) and to synthesize images based on input scribbes. We use DCGAN and styleGAN implementations as the trained models. The dataset we use is a grumpy cat dataset. Some sample images are shown below.

dataset representative images — Dataset representative images

Criterion

We optimize for similarity as MSE loss and also include perceptual loss towards the total loss. We use the convex combination of both losses as the total loss. We restrict the perceptual loss at the conv1 embeddings from the VGG19 network. We experiment with the contribution of the perceptual loss(lambda_perceptual parameter) to the total loss and evaluate how it effects the synthetic images.

Sample noise

We optimize for the sample noise(latent) that satisfies the given condition--this could that the GAN output from this sample noise should correspond to a given image(sampling and interpolation task) or that the output should correspond to a real image that resembles a scribble/sketch(draw task). Given that different architectures use different versions of the latent noise(DCGAN uses an N dimensional sample(z) from a gaussian distribution while styleGAN could also use an embedding(w/w+) of this noise as the input). Thus our optimization nature could change depending our choice of sample noise. We use the LBFGS solver that does a quasi-netwon optimization for a second-order solution.

Project

In this problem, we evaluate how a given image could be generated by a GAN model by optimising for the latent vector corresponding to the input image. We follow the same loss as defined the criterion and see how the choice of contribution of perceptual loss and nature of the latent vector affects the generated image.

How the perceptual loss affects the output image:

Coke can transferred into pepsi bottle — **Left-to-right**: Images in the order: input, generated with lambda_perceptual={0, 0.1, 1} respectively. All settings have been run for 1000 iterations with optimization at w.

We observe that the we have the best outcome with no perceptual loss. While we had a decent cat images from all settings, depending solely on the perceptual loss(right image) suggests an unnatural image(notice the cat eyes). During the training process with we notice that the "cat" structure integrity and background has been maintained from initial epochs and the training process is only adjusting the fine-grain details like whiskers and patch structure around the eyes.

How the choice of network affects the output image:

We notice that the DCGAN model outputs have aliasing like artifacts that make the right edge of the image look fuzzy. However, this effect is not observed with . We notice that with DCGAN model the optimization is about 20x-30x faster over styleGAN. The run-time for 1000 iterations with DCGAN was ~1 minute.

We notice that the DCGAN model has similar artifacts as the styleGAN with perceptual loss where we notice that increasing the contribution of perceptual introduces deformities.

How the choice of sample noise affects the output image:

We notice that with z noise we observe that the background is green while with w/w+ we have a similar background as the input. Notice that with all three noise vectors headpose is not aligned with the input image.

Comprehensive progress of training

Interpolation

In this problem we project given images into the latent noise space and linearly interpolate between the images through the generator.

We see that in z-space the headpose is very well captured during interpolation however the cat images at the end are fuzzy and do not hold the details at eye region(first row). With w/w+ the challenge is that its hard to capture headpose in the first row but relatively better in the second row.

Scribble to Image

In this problem we attempt to generate an image that corresponds to the input scribble image.

Image optimized for 1st conv layer with contentloss — **Top-to-bottom**: Different scribbles used for synthesizing images.
**Left-to-Right**: input scribble, mask from the scribble, output image from styleGAN using z, w and w+ noise.

A standing out feature is that w+ images have trouble with background synthesis. We also notice that z/w produce cats with blue eyes despite the scribble with black eyes(see row-3 above), this is not the case with w+ where the eyecolor closer to the scribble. Another interesting comparison is the cat ears: we see that the input cat ears are captured in the w+ space followed by w-space. z-space while does captures ears it is not as dependable as w/w+. In z-space we see sparse scribbles are very well synthesized however, the dense scribble causes artifacts and color discrepancies(see last row). Barring the background trouble the w/w+ produce better outputs.

Original assignment website here.
Website template copied from here.