Assignment #5 - GAN Photo Editing

Rohan Rao (rgrao@andrew.cmu.edu)

An example of grumpy cat outputs generated from sketch inputs using this assignment’s output.

Introduction

In this assignment, we look at the concept of using generative networks for the purpose of user-defined image editing. The way this is done is by first "inverting" a pre-trained generator. What this means is that we will identify the latent code that can lead the generator to reconstruct the provided real image. We then use this latent code and perturb or interpolate it in latent space, and we can see how this allows us to manipulate the generated images.

Part 1: Inverting the Generator

For the first part of the assignment, we solve an optimization problem to reconstruct the image from a particular latent code. Here we try to constrain the output of the trained generator to be as close as possible to the natural image manifold. This results in the following non-convex optimization problem:

For some choice of loss L and trained generator G and a given real image x, we can write

z=arg minzL(G(z),x).

For the loss function, we use a standard L2 loss combined with a perceptual loss, as defined in the paper "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric", among others. Here we specifically use the losses at various conv layers of the VGG16 network as a perceptual loss. As this is a nonconvex optimization problem where we can access gradients, we can attempt to solve it with any first-order or quasi-Newton optimization method. We use the full batch version of LBFGS optimization with line search for this.

Implementation Details

  • The Criterion class describes the loss function and takes in a mask for the sketch to image as well. We use a weighted average of perceptual and L2 losses that are between 0 and 1.
  • The sample_noise function for a vanilla GAN simply uses a random noise vector of the dimension required for the GAN. For the StyleGAN, we use the sampling procedure that uses the mapping function of the model and stacks/averages based on the from_mean condition.
  • Finally, the whole functionality is implemented in project so you can run the inversion code. This means the criterion, noise sampling, and optimization are run on each image in the data stream.

Deliverables

The image above shows some samples from the latent space of the vanilla DCGAN network.

The image above shows some samples from the "w" latent space of the StyleGAN network.

The image above shows some samples from the "w+" latent space of the StyleGAN network.

Here are some results of the image reconstruction for N=1000 iterations each time. The losses are varied from 0 to 1 (as mentioned in the caption) and usually a very small perceptual loss helps in maintaining realism. The StyleGAN results usually look better than the vanilla GAN ones, probably due to the complexity of the underlying latent space and model. The best results are also on the StyleGAN models with close-to-zero perceptual loss and with w+ space. However, these also take the longest to optimize due to their complexity.

The following images show the desired image on the far left, followed by the iterations of the optimization at the 250, 500, 750 and 1000th steps.

The image above shows Vanilla GAN reconstruction with perceptual weight = 0.

The image above shows Vanilla GAN reconstruction with perceptual weight = 0.5.

The image above shows Vanilla GAN reconstruction with perceptual weight = 1.

The image above shows StyleGAN reconstruction with w space and with perceptual weight = 0.

The image above shows StyleGAN reconstruction with w space and with perceptual weight = 0.5.

The image above shows StyleGAN reconstruction with w space and with perceptual weight = 1.

The image above shows StyleGAN reconstruction with w+ space and with perceptual weight = 0.

The image above shows StyleGAN reconstruction with w+ space and with perceptual weight = 0.5.

Part 2: Interpolate your Cats

Now that we have a technique for inverting the cat images, we can do arithmetic with the latent vectors we have just found. One simple example is interpolating through images via a convex combination of their inverses. More precisely, given images x1 and x2, compute z1=G1(x1),z2=G1(x2). Then we can combine the latent images for some θ(0,1) by z=θz1+(1θ)z2 and generate it via x=G(z). Choose a discretization of (0,1) to interpolate your image pair.

Deliverables

From the GIFs below, we can see that the interpolation can change the face poses, rotate them, zoom in or out, change the color of the face hair, and so on. However, we can also see that all of the interpolated images lie on the smooth real-image manifold too!

Above sequence is for StyleGAN with the w+ latent space.

Above sequence is for StyleGAN with the w latent space.

Part 3: Scribble to Image [40 Points]

Next, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem, but could be many other things as well. To generate an image subject to constraints, we solve a penalized nonconvex optimization problem. We’ll assume the constraints are of the form fi(x)=vi for some scalar-valued functions fi and scalar values vi.

Written in a form that includes our trained generator G, this soft-constrained optimization problem is

z=arg minzi||fi(G(z))vi||2.

Color Scribble Constraints: Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image sRd with a corresponding mask m0,1d. Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like mixi=misi.

Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to

z=arg minz||MG(z)MS||2,

where is the Hadamard product, M is the mask, and S is the sketch

Implementation Details

Results

Note: All the images here are created by designing a sketch using Sketchpad, and then saving it as an RGB-alpha image, which includes the transparency/mask layer that specifies the areas that are constrained as described above.

Above we have a relatively sparse sketch, since we have only provided the outline of the various features. As a result, the cat faces being generated are quite varied and differ more significantly. I have varied both the type of model as well as the perceptual loss and retained the best results. We can also see that since we have the cat ears in the sketch, all the generated images also have the ears at the top of the image.

Here this sketch has even less detail about the whiskers and the ears, meaning that the model is able to generate different variations and zoomed-in versions of the cats like in the first image where it is a very close-up view with whiskers, and the final image where it is a relatively zoomed-out view without clearly visible whiskers.

Here we have provided a relatively strong and dense sketch, and as a result we get images that are more or less similar to each other, with slight zoom variations.

Once again, we have quite a dense sketch that also provides the ears for the cat, and so we have images being generated with the cat ears and approximately the same pose across different optimizations. We also don't explicitly draw whiskers, so the images sometimes have them and sometimes it is not as prominent.

In this final sketch, most of the sketch details are provided, and whiskers are drawn too, and so the resulting cats also have pretty clear whiskers. The pose is also relatively fixed due to the amount of information in the sketch constraint.

Further Resources