GAN Photo Editing - Tarang Shah

16726 - Learning Based Image Synthesis - HW5


Andrew Id: tarangs

Part 1 - Inverting the Generator

Background

The core idea here is to invert a image generator. Before we get into details of the inversion and how we do it, lets first understand what the generator does.

An image generator usually takes a vector as input and generates an image. Essentially it is a black box which takes a vector and returns an image.

Vector → [Generator] → Image

We can see the visualization of the generator model from our "Vanilla GAN" of Homework 3 below

Generator from the Vanilla GAN

Although we show the Generator from a simple GAN, it is possible to use any generator. For the purposes of this assignment, we use 2 generators. We use the above Vanilla GAN generator and also the very famous and popular StyleGAN2 generator.

Describing the task

Now that we have seen what a generator does, lets talk about our task. Our first task is to generate a vector from a given input image. This is literally the opposite of what the generator does 🙃.

The vector we want from a given image is also known as a latent vector, since it belongs to the "latent space" of the generator.

Given an input image, the goal is to find a latent vector that produces the input image when we pass it through the generator.

Input Image → [ ?? ] → Latent Vector → [Generator] → Generated Image

Our goal is to figure out the "??" in the above process, such that the Generated Image is as similar to the input image as possible. Since the process is the reverse of what the Generator does, we call this "inverting" the generator.

Doing the task

We use optimization techniques for achieving this inversion. We don't actually use a model to replace the "??" in the above image, but we use some math and optimization techniques to achieve the results we want.

Steps followed

  1. We start with a random Latent Vector and then pass it through the Generator.(we enable requires_grad as it
  1. The Generator is in eval mode so it is only use for a forward pass
  1. Since we want the resultant image from the Generator to be as close to the real image as possible, we need to build a loss function for the same
    1. We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
      1. Metric=(1λ)×mse_loss+λ×perceptual_lossMetric = (1-\lambda)\times mse\_loss + \lambda \times perceptual\_loss (here λ\lambda is also called perceptual weight or perc_wgt)
      1. The Perceptual Loss here is the "Content Loss" at conv_4 of a VGG network, as described here
    1. The loss is calculated between the resultant image and the input real image
      1. Loss=[Metric(RealImage)Metric(GeneratedImage)]2Loss = \sqrt{[Metric(RealImage)-Metric(GeneratedImage)]^2}
  1. We use an LBFGS optimizer on this loss to optimize the input Latent Vector
  1. Finally after about 1000-2000 iterations, we can use the resultant vector as an optimized vector

Experiments and Results

For the Latent vector generation and the image generation, we use 2 models. For both the models we use a variation of the latent vectors

  1. Vanilla GAN
    1. A simple random vector - zz
  1. StyleGAN2
    1. A simple random vector - zz
    1. A single weight vector using StyleGAN's internal mapping network - ww
    1. A collection of weight vectors using StyleGAN's internal mapping network - w+w+ (this is technically a latent "tensor" but we refer to it as a latent vector for brevity)
      We can see the mapping network in the above image which maps a random vector z to the weight space

For both the above models, we run the following experiments as part of Task 1

  1. A Simple forward pass on randomly generated zz, ww and w+w+  latent vectors
  1. Given an input image, find a zz, ww and w+w+  in the latent space that gives the closest possible image from the generator.
    1. Additionally experiment with different weights of of the perceptual and mse loss
    1. We vary between the 2 models and also vary and see the results for zz, ww and w+w+  for the StyleGAN model and see the difference in results

Show some example outputs of your image reconstruction efforts using (1) various combinations of the losses, (2) different generative models, and (3) different latent space (latent code, w space, and w+ space). Give comments on why the various outputs look how they do. Which combination gives you the best result and how fast your method performs.

Results on randomly sampling the Vectors

Vanilla GAN - Sampling Z
StyleGAN2 - Sampling Z
StyleGAN2 - Sampling W
StyleGAN2 - Sampling W+

Since these are randomly sampled vectors, we can only see that the StyleGAN Results are much better in quality compared to the VanillaGAN. This is expected as StyleGAN2 is a much larger model and is trained on better images originally, making it easier to

For the below 2 experiments on VanillaGAN and StyleGAN Projections, we use the following image original image.

Results on Generating a Latent vector with Vanilla GAN

For the Vanilla GAN we only have the option of choosing a Latent Vector z\textbf{z}

We chose a perc_wgt(λ\lambda) of 0.002

Results on StyleGAN

Base Image

Reconstructed Image from Latent vectors (we show the 1500th or 2000th iteration for each of the images below)

perc_wgt → Latent Vector Type ↓

z\textbf{z}(2000it)

w\textbf{w}(1500it)

w+\textbf{w}+(2000it)

0.002

0.5

0.9


The first observation from the random sample is valid here as well, we can see StyleGAN gives better results than VanillaGAN. This due to StyleGAN bieng a much better designed and larger model.

We can also see that the perc_wgt = 0.002 gives the best results overall. Especially in terms of the image clarity and reconstruction similarity(qualitatively observed). Since the original paper also mentions this, this was expected. Within perc_wgt=0.002, we can see that w+\textbf{w+} gives the best output, with w\textbf{w} being a close next followed by the z vector generated image.

Hence we chose either w\textbf{w} or w+\textbf{w+} for the next tasks. We also use perc_wgt=0.002 for the experiments below.

For one of the Bells and Whistles, I ran the 256x256 version of the same image and generate the StyleGAN reconstructed images.(VanillaGAN is not trained on high res images, hence no results shown for that model)

Original Image
w+ Reconstructed
w reconstructed
z reconstructed

Here too we can observe that w+ superior results.

Part 2 - Image Interpolation using GANS

Background and Task Description

For this task, we want to interpolate between 2 images. Naively interpolating between to images just generates a simple fade transition. But in this case, we want to have a high level context during interpolation.

Naive Interpolation Example

Naive interpolation essentially takes Image 1 and Image 2 and directly interpolates the pixel values.

It=tI1+(1t)I2I_{t} = t*I_1 + (1-t)*I_2

Here, t = timestamp of the intermediate frame. We scale t to ensure 0<t<1 0<t<1. Where 1 represents the maximum time

Video generated using CSS cross fade example here

As we can see that this works fine but the intermediate frames are essentially just a noisy sum of 2 images.

Using GANs, we can get a much more intuitive interpolation and even interpolate specific aspects of the face, resulting in a much more smoother and natural image.

We use the core idea discussed above to generate a Latent Vector for Both the images. Now, since they are part of the same Latent Space, it is possible to interpolate between the 2 images by simply interpolating the latent vectors.

Interpolating the latent vectors instead of the image gives much better results.

zt=tz1+(1t)z2It=G(zt)z_t = t*z_1+(1-t)z_2 \\ I_t = G(z_t)

Here, ztz_t = Interpolated Latent Vector ItI_t = Interpolated Image GG = Generator z1z_1,z2z_2 = Latent vectors of Image 1 and 2

Results and Experiments

Using StyleGAN

Show a few interpolations between grumpy cats. Comment on the quality of the images between the cats and how the interpolation proceeds visually.

Using W

Original Image 1

Generated Image 1

Original Image 2

Generated Image 2

Blended GIF(looping)

Using W+

Original Image 1

Generated Image 1

Original Image 2

Generated Image 2

Blended GIF(looping)

Based on the w and w+ results, we can see the generated images and the interpolation of w+ is much smoother. The w interpolation seems smooth but it involves a warping which makes it look unrealistic.

Bells and Whistles on High Res Image

Here is an example of the highres w+ interpolation case

Original Image 1

Generated Image 1

Original Image 2

Generated Image 2

Blended GIF

In this case, we can see that the input images are quite challenging, especially since they have sunlight and clothing which is not seen in majority of the grumpy cat images. Despite this, we can see the StyleGAN with w+ reconstruction is able to replicate the Cat's expression and also the face direction. Once we have the reconstructed images(and hence the latent vectors, we can use the interpolation function as described above. The generated gif is also quite smooth and accurate.

Task 3 - Scribble to Image

Background

In this section, instead of optimizing using an input image, we use a input sketch(which includes simple hand drawings) to generate an image using the generator.

The sketch serves as a "Soft" constraint for our generator. We do the same process of optimization on the latent vector to ensure the generated image matches the sketch.

We also apply a Mask to ensure that only the pixels where we have the sketch are used for optimization.

We can either apply the Mask on the input image itself or we can also apply it to the feature map that is generated by VGG.

For the current version, we apply the mask on the input image.

Results

Draw some cats and see what your model can come up with! Experiment with sparser and denser sketches and the use of color. Show us a handful of example outputs along with your commentary on what seems to have happened and why.

Sketch

StyleGAN Reconstructed (using w+) ~1500/1250 iters

Failure Cases (2000+iterations)

Discussion

At around 1500 iterations, we can see that the we are able to replicate the sketch pretty accurately. The last 2 custom sketches also work quite nicely.

As we increase the number of iterations, we see that at 2000 iterations, for sparse sketches, the image "blows up" to simply match the sketch colors with no distinct features. Also, when we have dense sketches, since the mask is also dense, we can see the generated image matches the sketch closely instead of replicating the colors.

Highres results

Sketch

Generated(StyleGAN, w+)

Failures(2000+ iter)

We that the high res result are more detailed and also replicate the sketch somewhat more accurately. This is most likely due to the larger number of parameters available both inside and outside the mask.

Bells and Whistles

I used the highres models and data for the above tasks, the results and discussions for which are added inline with the relevant tasks.