Overview

In this assignment, We will implement a few different techniques to manipulate images on the manifold of natural images. First, we will invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the assignment, we will take a hand-drawn sketch and generate an image that fits the sketch accordingly.

Part 1: Inverting the Generator

We will solve an optimization problem to reconstruct the image from a particular latent code. Natural images lie on a low-dimensional manifold. We choose to consider the output manifold of a trained generator as close to the natural image manifold. So, we can set up the following nonconvex optimization problem: For some choice of loss $L$ and and trained generator $G$ and a given real image $x$, we can write: $$z^{*}=\arg \min _{z} \mathcal{L}(G(z), x).$$ For loss function, we use $L_2$ losses as well as some combination of the perceptual (content) losses from VGG16 Net features.

Model, latent sapce type	Perceptual Loss Weight	10%	30%	50%	70%	90%
Vanilla GAN, Z space	iter 0
Vanilla GAN, Z space	iter 1000	105.04s	102.65s	100.85s	107.49s	104.54s
StyleGAN, Z space	iter 0
StyleGAN, Z space	iter 1000	223.79s	227.32s	225.37s	220.09s	222.43s
StyleGAN, W space	iter 0
StyleGAN, W space	iter 1000	226.63s	226.22s	226.03s	229.05s	221.30s
StyleGAN, W+ space	iter 0
StyleGAN, W+ space	iter 1000	225.08s	226.01s	224.24s	221.91s	223.30s

Here we use the combination of $L_2$ loss and perceptual loss as our loss function. While $L_2$ loss constrain on raw pixel value, trying to push the reconstucted image similar to the target image in pixel color value, perceptual loss constrain on content, trying to reconstuct similar cat figure structure and pose. Thus, we can see from the images above, from left to right, with perceptual loss weight increasing, the color value may not likely to be similar, but the cat figure and pose gradually become more similar.

Comparing results from vanilla GAN and StyleGAN, apparently the reconstruction from StyleGAN is better. The cats from vanilla GAN are blurred and some parts are even twisted, while cats from StyleGAN are very clear and detailed. StyleGAN with W/W+ space latent vector outperformed StyleGAN with Z space latent vector, but it's hard to tell which whether W or W+ is better. For W space, the combination of StyleGAN and 70% perceptual loss seem to give the best result (most accurate pose), and it takes 229.05s to reconstruct the cat. For W+ space, the combination of StyleGAN and 10% perceptual loss seem to give the best result, and it takes 225.08s to reconstruct the cat.

Part 2: Interpolate your Cats

Now that we have a technique for inverting the cat images, we can do arithmetic with the latent vectors we have just found. One simple example is interpolating through images via a convex combination of their inverses. More precisely, given images $x_1$ and $x_2$, compute $z_1=G^{-1}(x_1)$, $z_2=G^{-1}(x_2)$. Then we can combine the latent images for some $\theta \in(0,1)$ by $z^{\prime}=\theta z_{1}+(1-\theta) z_{2}$ and generate it via $x'=G(z')$. Choose a discretization of $(0,1)$ to interpolate our image pair.

As shown in the gif image, the cats gradually transit from the source cats appearance to the destination cats appearance. The generated fake transition images look pretty realistic. But we can also conclude that the interpolation performance depends on the quality of latent vectors (how well they can reconstruct the original cat). The first row is an example of good performance, while the latent vector of the destination image in the second row can not perfectly reconsturct the destination cat, the transition process doesn't end up perfectly matching the destination cat.

Part 3: Scribble to Image

Next, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem, but could be many other things as well.

Color Scribble Constraints: Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image $s \in \mathbb{R}^{d}$ with a corresponding mask $m \in (0,1) d^{d}$. Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like $m_{i} x_{i}=m_{i} s_{i}$. Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to $$z^{*}=\arg \min _{z}\|M * G(z)-M * S\|^{2},$$ where $*$ is the Hadamard product, $M$ is the mask, and $S$ is the sketch. The results are shown below. (Since the Scribble to Image generation process has a lot of randomness, we only show some good ones here.)

From left to right, the scribble becomes denser, and the generated cats transits from more realistic style to more painting style (more blurred and pale). It aligns with the intuition that denser scribble with more color would add more constraints on the latent vector, making it deviates more from the original cat manifold space, thus leading to cat images showing color with high saturability, like what in the scribbles.

Source Image	Reconstructed Src Image	Interpolation Process	Reconstructed Dst Image	Destination Image
Source Image	Reconstructed Src Image	Interpolation Process	Reconstructed Dst Image	Destination Image

16-726 Learning-Based Image Synthesis

Project 5: GAN Photo Editing

Chang Shi

Overview

Part 1: Inverting the Generator

Part 2: Interpolate your Cats

Part 3: Scribble to Image

Scribble
Generated Cat