Project 5

For this project, I used a pretrained StyleGAN network to generate and edit pictures of grumpy cats.

Projection

Before we can do any of the following editing tasks, we need to find a way to represent our target images in the provided pretrained GANs. We accomplish this by solving an optimization problem over the latent space of the generator using the LBFGS optimizer. To compute the loss, we compute an L1 pixel difference between the generated and target image, and a MSE feature loss between the two using the pretrained VGG-19 network.

Below are some reconstruction attempts using different latent space parametrizations and generator architectures.

Source Image z Latent (Vanilla) w Latent (StyleGAN) w+ Latent (StyleGAN)

In general, it looks like the StyleGAN results are closer to the input image, which makes sense considering that the w and w+ latent spaces have higher dimensionality than the z latent space. Furthermore, the vanilla GAN results appear to be less diverse than StyleGAN, which also makes sense given that the vanilla model is smaller. Another consequence of the smaller latent space and model size though is that optimization steps for the vanilla GAN with z latents are much faster than the others tested. However, since I felt that StyleGAN with w+ latents produced the best results, I tested that generator configuration with a few different perceptual loss weights. Since all that really matters is the ratio between the two, here I fix the L1 loss weight at 1.

Source Image Perceptual Weight 0.1 Perceptual Weight 1 Perceptual Weight 10

One thing to note here is that when L1 loss is more powerful than perceptual loss, the projection is better at reproducing "non-cat" objects like the hat in the last picture. This follows from the fact that our optimization is prioritizing creating an exact pixel recreation of the source image. In the case where we use perceptual loss, the results appear a bit closer in terms of large scale structure, such as the head shape in the third image. The fifth image is a strange exception to this which I'm not entirely sure why it failed reconstruction.

Interpolation

Once we have the latent space embeddings of two target images, it's pretty easy to linearly interpolate them and pass the result into our generator. Here are a few sample interpolations I made with this method. The following were generated with StyleGAN using w+ latents and equal weighted perceptual and L1 losses.

Source Image Destination Image Interpolation

Overall, the interpolations are only as good as the projections we can generate of the source and destination images, so the results don't perfectly match the input images, and there is some visible pixel noise. On the other hand, because the output of our generator is the manifold of grumpy cat images, the results do do a good job of always looking like a realistic cat regardless of which frame we're on. In particular this is compelling in cases where the source and target images are facing in different directions, and the interpolation appears to rotate the face smoothly.

Sketch to Image

Finally, let's try generating realistic cat images based on a sketch as input. To do this, I updated the loss functions to incorporate a mask so that only certain pixels are considered when comparing the target image and the output from the generator. This process is actually very similar to the projection step, except that in this case our loss function is a bit more specialized. Below are the results of running this algorithm on the provided sketches for 1000 iterations with various parameters changed.

Sketch Image z Latent w Latent w+ Latent

I found that the w+ latents tended to let the projection overfit a bit too much. Even though certain regions of the image were masked out, the first few layers of the VGG-19 network for perceptual loss were able to read these pixels and propogate their values to nearby pixels which were not masked out, causing the darker edges. Using the w latents helped a bit to get more accurate reconstructions also, since the w+ latents tended to overfit the input drawings producing unnatural glowing-eyed results.

The projection being performed here seems to produce the best results when the input data is sparse. Drawings like the third and fourth rows provided too much pixel information, which prevents the optimization from generating a more realistic cat image. The first two images though, include sparse information, giving the generator far more room to create an image which still satisfies the constraints. Below are a few sketches I made myself, and the results using the StyleGAN generator with w latents.

Sketch Image Result

Here, I wanted to test how the projection would respond to irregular grumpy cats. My first image tested faces with a happy expression, the second tested an upside-down cat image, and the third tested a cat with an irregular head shape. Overall, the realistic images tended to lose their irregularities. The upside-down cat face still ended up being shown as right side up, and the triangular cat face created a cat with a round face and a grey stripe in his fur around where the edge should have been. These results aren't that surprising though, considering that the output of StyleGAN should only cover the manifold of realistic upwards-oriented cat images. This projection procedure only provides suggestions for what the pixel color of the solution should be. Furthermore, the sketch only defines color, not gradient, so features like edges aren't easily preserved.