Task

The goal of this project is to implement techniques to manipulate images on the manifold of natural images. Essentially, we would like to take a hand-drawn sketch of a cat and generate a real-looking cat image that fits the constraints given by a sketch.

Inverting the Generator

We already have pre-trained generator models (VanillaGAN and StyleGAN) which were implemented in previous projects. Accordingly, our first step in this project is to invert these generators such that we can obtain the latent code for input images. We do this by generating random noise (in z/w/w+ space) and optimizing the reconstruction of the input image from the latent code. This is represented by the following optimization problem:

Here, "G" is the generator model (VanillaGAN or StyleGAN), "z" is the random noise, "x" is the input image, "L" is the loss function and "z*" is the latent code for the input image. For the loss function L, we used the L-2 loss as well as Perceptual (Content) Loss.

In the rest of this section, we present the outputs of the reconstruction process using different combination of generator models (VanillaGAN/StyleGAN), latent spaces (z/w/w+), and losses (L-2 Loss/Perceptual Loss). The following cat image will be used as the input.

Figure 1: Input Cat Image for Reconstruction from Latent Code

VanillaGAN

If we use a VanillaGAN, we cannot use the w/w+ latent space. Accordingly, here's the output reconstructed from the z latent space using L-2 Loss alone vs L-2 Loss + Perceptual Loss.

Figure 2: Reconstructed Cat Image from Z Latent Space w/o (left) vs w/ (right) Perceptual Loss

Both with and without Perceptual Loss, the outputs look very similar to the original image, but a bit blurry (since VanillaGAN is not the best model). However, the cat output with Perceptual Loss has a more defined face structure (notice the jaw and nose in particular). This is expected since Perceptual Loss helps maintain structural characteristics of images.

StyleGAN

Now let's try using the StyleGAN model. First, we once again use L-2 Loss alone but reconstruct images from all three latent spaces (z, w & w+).

Clearly the outputs using StyleGAN are much better than the VanillaGAN ones, as this model is better for obtaining high quality results. The w-space output has a much smoother color palette than the one reconstructed from the z-space. The w+ space output is also of good quality, but the eyes are very different from the other two images (the color palette is also slightly different).

Now let's add in the Perceptual Loss and repeat the process.

In terms of quality, the outputs are pretty much the same as before. However, the cat faces are a bit more defined in these images (again, due to Perceptual Loss), though the effect is very subtle (notice the eyes).

Among all the output images in this section, the one reconstructed using the StyleGAN model, w latent space, and L-2 + Perceptual Loss was most similar to the original input image.

Interpolating Cats

Now that we have a way to obtain the latent code for cat images, we can perform algebraic operations on them. To keep things simple, we compute a linear combination of two cat images using the following equation:

Here, z1 and z2 are the latent codes for two input images obtained using the inversion process created in the previous section, and theta is a value between 0 & 1 which specifies the amount of the two images mixed together. Once we have the output latent code z', we pass it through our generator and obtain the interpolated output.

As an example, let's use the following two cat images as the inputs.

Cat Image #1 — Figure 5: Input Cat Images for Interpolation

Cat Image #2 — Figure 5: Input Cat Images for Interpolation

Using L-2 + Perceptual Loss, below is the interpolation output for the z latent space code using the VanillaGAN model...

Figure 6: Interpolated Output from Z Latent Space (right) using VanillaGAN

And here are all three latent space codes using the StyleGAN model...

As expected, the outputs when the StyleGAN model is used are of higher quality than with VanillaGAN. Additionally, the VanillaGAN image frames change very drastically and, therefore, there are abrupt transitions. Among the StyleGAN outputs, the cat face in z latent space interpolation seems to first shrink and then grow back up to the size of the second cat input; the face also rotates during this process to match the orientation of the inputs. This shrinking and enlargement is not seen in the w and w+ latent space outputs. Instead, the cat face smoothly warps between the two input images.

Scribble to Image

Finally, we implemented the function to convert a scribble of a cat into a realistic image. To constrain the output of the GAN using the scribble, we use the following optimization problem:

Here, S is the scribble and M is the mask (representing the pixels of the scribble vs the background); X * Y represents the Hadamard Product between two images (or matrices) X & Y.

To test the scribble to image function, we start by using a sketch provided in the project dataset.

Input Cat Sketch — Figure 8: Input (left) vs Generated (right) Regular Cat Sketch

Output Cat Sketch — Figure 8: Input (left) vs Generated (right) Regular Cat Sketch

Since this seemed to work well, we tried creating custom scribbles (for simplicity, the earlier sketch was modified to create new ones). First, we compared wide vs narrow sketches.

As evident by the outputs, the reconstructed images fit the sketches; the output cat face is wide for the wide cat sketch, and narrow for the narrow one. After this, we compared dense vs sparse ones.

We noticed that dense sketches do not give very good outputs. This is most likely due to the fact that there are too many constraints. On the other hand, the sparse sketch gave an extremely clear output owing to limited constraints.

Finally, we tested out different color palettes.

Depending on the color palette used, our output cat images show a tinge of those colors as can be seen in the images above.