GAN Photo Editing - Tarang Shah
16726 - Learning Based Image Synthesis - HW5
Andrew Id: tarangs
Table of Contents
Part 1 - Inverting the Generator
Background
The core idea here is to invert a image generator. Before we get into details of the inversion and how we do it, lets first understand what the generator does.
An image generator usually takes a vector as input and generates an image. Essentially it is a black box which takes a vector and returns an image.
Vector → [Generator] → Image
We can see the visualization of the generator model from our "Vanilla GAN" of Homework 3 below
Although we show the Generator from a simple GAN, it is possible to use any generator. For the purposes of this assignment, we use 2 generators. We use the above Vanilla GAN generator and also the very famous and popular StyleGAN2 generator.
Describing the task
Now that we have seen what a generator does, lets talk about our task. Our first task is to generate a vector from a given input image. This is literally the opposite of what the generator does 🙃.
The vector we want from a given image is also known as a latent vector, since it belongs to the "latent space" of the generator.
Given an input image, the goal is to find a latent vector that produces the input image when we pass it through the generator.
Input Image → [ ?? ] → Latent Vector → [Generator] → Generated Image
Our goal is to figure out the "??" in the above process, such that the Generated Image is as similar to the input image as possible. Since the process is the reverse of what the Generator does, we call this "inverting" the generator.
Doing the task
We use optimization techniques for achieving this inversion. We don't actually use a model to replace the "??" in the above image, but we use some math and optimization techniques to achieve the results we want.
Steps followed
- We start with a random Latent Vector and then pass it through the Generator.(we enable
requires_grad
as it
- The Generator is in
eval
mode so it is only use for a forward pass
- Since we want the resultant image from the Generator to be as close to the real image as possible, we need to build a loss function for the same
- We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
- (here is also called perceptual weight or
perc_wgt
)
- The Perceptual Loss here is the "Content Loss" at
conv_4
of a VGG network, as described here
- (here is also called perceptual weight or
- The loss is calculated between the resultant image and the input real image
-
- We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
- We use an LBFGS optimizer on this loss to optimize the input Latent Vector
- Finally after about 1000-2000 iterations, we can use the resultant vector as an optimized vector
Experiments and Results
For the Latent vector generation and the image generation, we use 2 models. For both the models we use a variation of the latent vectors
- Vanilla GAN
- A simple random vector -
- StyleGAN2
- A simple random vector -
- A single weight vector using StyleGAN's internal
mapping
network -
- A collection of weight vectors using StyleGAN's internal
mapping
network - (this is technically a latent "tensor" but we refer to it as a latent vector for brevity)
For both the above models, we run the following experiments as part of Task 1
- A Simple forward pass on randomly generated , and latent vectors
- Given an input image, find a , and in the latent space that gives the closest possible image from the generator.
- Additionally experiment with different weights of of the perceptual and mse loss
- We vary between the 2 models and also vary and see the results for , and for the StyleGAN model and see the difference in results
Results on randomly sampling the Vectors
Since these are randomly sampled vectors, we can only see that the StyleGAN Results are much better in quality compared to the VanillaGAN. This is expected as StyleGAN2 is a much larger model and is trained on better images originally, making it easier to
For the below 2 experiments on VanillaGAN and StyleGAN Projections, we use the following image original image.
Results on Generating a Latent vector with Vanilla GAN
For the Vanilla GAN we only have the option of choosing a Latent Vector
We chose a perc_wgt
() of 0.002
Results on StyleGAN
Base Image
Reconstructed Image from Latent vectors (we show the 1500th or 2000th iteration for each of the images below)
perc_wgt → Latent Vector Type ↓
(2000it)
(1500it)
(2000it)
0.002
0.5
0.9
The first observation from the random sample is valid here as well, we can see StyleGAN gives better results than VanillaGAN. This due to StyleGAN bieng a much better designed and larger model.
We can also see that the perc_wgt
= 0.002 gives the best results overall. Especially in terms of the image clarity and reconstruction similarity(qualitatively observed). Since the original paper also mentions this, this was expected. Within perc_wgt
=0.002, we can see that gives the best output, with being a close next followed by the z vector generated image.
Hence we chose either or for the next tasks. We also use perc_wgt
=0.002 for the experiments below.
For one of the Bells and Whistles, I ran the 256x256 version of the same image and generate the StyleGAN reconstructed images.(VanillaGAN is not trained on high res images, hence no results shown for that model)
Here too we can observe that w+ superior results.
Part 2 - Image Interpolation using GANS
Background and Task Description
For this task, we want to interpolate between 2 images. Naively interpolating between to images just generates a simple fade transition. But in this case, we want to have a high level context during interpolation.
Naive Interpolation Example
Naive interpolation essentially takes Image 1 and Image 2 and directly interpolates the pixel values.
Here, t = timestamp of the intermediate frame. We scale t to ensure . Where 1 represents the maximum time
As we can see that this works fine but the intermediate frames are essentially just a noisy sum of 2 images.
Using GANs, we can get a much more intuitive interpolation and even interpolate specific aspects of the face, resulting in a much more smoother and natural image.
We use the core idea discussed above to generate a Latent Vector for Both the images. Now, since they are part of the same Latent Space, it is possible to interpolate between the 2 images by simply interpolating the latent vectors.
Interpolating the latent vectors instead of the image gives much better results.
Here, = Interpolated Latent Vector = Interpolated Image = Generator , = Latent vectors of Image 1 and 2
Results and Experiments
Using StyleGAN
Using W
Original Image 1
Generated Image 1
Original Image 2
Generated Image 2
Blended GIF(looping)
Using W+
Original Image 1
Generated Image 1
Original Image 2
Generated Image 2
Blended GIF(looping)
Based on the w and w+ results, we can see the generated images and the interpolation of w+ is much smoother. The w interpolation seems smooth but it involves a warping which makes it look unrealistic.
Bells and Whistles on High Res Image
Here is an example of the highres w+ interpolation case
Original Image 1
Generated Image 1
Original Image 2
Generated Image 2
Blended GIF
In this case, we can see that the input images are quite challenging, especially since they have sunlight and clothing which is not seen in majority of the grumpy cat images. Despite this, we can see the StyleGAN with w+ reconstruction is able to replicate the Cat's expression and also the face direction. Once we have the reconstructed images(and hence the latent vectors, we can use the interpolation function as described above. The generated gif is also quite smooth and accurate.
Task 3 - Scribble to Image
Background
In this section, instead of optimizing using an input image, we use a input sketch(which includes simple hand drawings) to generate an image using the generator.
The sketch serves as a "Soft" constraint for our generator. We do the same process of optimization on the latent vector to ensure the generated image matches the sketch.
We also apply a Mask to ensure that only the pixels where we have the sketch are used for optimization.
We can either apply the Mask on the input image itself or we can also apply it to the feature map that is generated by VGG.
For the current version, we apply the mask on the input image.
Results
Sketch
StyleGAN Reconstructed (using w+) ~1500/1250 iters
Failure Cases (2000+iterations)
Discussion
At around 1500 iterations, we can see that the we are able to replicate the sketch pretty accurately. The last 2 custom sketches also work quite nicely.
As we increase the number of iterations, we see that at 2000 iterations, for sparse sketches, the image "blows up" to simply match the sketch colors with no distinct features. Also, when we have dense sketches, since the mask is also dense, we can see the generated image matches the sketch closely instead of replicating the colors.
Highres results
Sketch
Generated(StyleGAN, w+)
Failures(2000+ iter)
We that the high res result are more detailed and also replicate the sketch somewhat more accurately. This is most likely due to the larger number of parameters available both inside and outside the mask.
Bells and Whistles
I used the highres models and data for the above tasks, the results and discussions for which are added inline with the relevant tasks.