16-726 Learning-Based Image Synthesis

Project 5: GAN Photo Editing

Chang Shi


Overview

In this assignment, We will implement a few different techniques to manipulate images on the manifold of natural images. First, we will invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the assignment, we will take a hand-drawn sketch and generate an image that fits the sketch accordingly.

Part 1: Inverting the Generator

We will solve an optimization problem to reconstruct the image from a particular latent code. Natural images lie on a low-dimensional manifold. We choose to consider the output manifold of a trained generator as close to the natural image manifold. So, we can set up the following nonconvex optimization problem: For some choice of loss $L$ and and trained generator $G$ and a given real image $x$, we can write: $$z^{*}=\arg \min _{z} \mathcal{L}(G(z), x).$$ For loss function, we use $L_2$ losses as well as some combination of the perceptual (content) losses from VGG16 Net features.


Target Image

Model, latent sapce type

Perceptual Loss Weight

10%

30%

50%

70%

90%

Vanilla GAN, Z space

iter 0

iter 1000

105.04s
102.65s
100.85s
107.49s
104.54s

StyleGAN, Z space

iter 0

iter 1000

223.79s
227.32s
225.37s
220.09s
222.43s

StyleGAN, W space

iter 0

iter 1000

226.63s
226.22s
226.03s
229.05s
221.30s

StyleGAN, W+ space

iter 0

iter 1000

225.08s
226.01s
224.24s
221.91s
223.30s

Here we use the combination of $L_2$ loss and perceptual loss as our loss function. While $L_2$ loss constrain on raw pixel value, trying to push the reconstucted image similar to the target image in pixel color value, perceptual loss constrain on content, trying to reconstuct similar cat figure structure and pose. Thus, we can see from the images above, from left to right, with perceptual loss weight increasing, the color value may not likely to be similar, but the cat figure and pose gradually become more similar.

Comparing results from vanilla GAN and StyleGAN, apparently the reconstruction from StyleGAN is better. The cats from vanilla GAN are blurred and some parts are even twisted, while cats from StyleGAN are very clear and detailed. StyleGAN with W/W+ space latent vector outperformed StyleGAN with Z space latent vector, but it's hard to tell which whether W or W+ is better. For W space, the combination of StyleGAN and 70% perceptual loss seem to give the best result (most accurate pose), and it takes 229.05s to reconstruct the cat. For W+ space, the combination of StyleGAN and 10% perceptual loss seem to give the best result, and it takes 225.08s to reconstruct the cat.

Part 2: Interpolate your Cats

Now that we have a technique for inverting the cat images, we can do arithmetic with the latent vectors we have just found. One simple example is interpolating through images via a convex combination of their inverses. More precisely, given images $x_1$ and $x_2$, compute $z_1=G^{-1}(x_1)$, $z_2=G^{-1}(x_2)$. Then we can combine the latent images for some $\theta \in(0,1)$ by $z^{\prime}=\theta z_{1}+(1-\theta) z_{2}$ and generate it via $x'=G(z')$. Choose a discretization of $(0,1)$ to interpolate our image pair.

Source Image
Reconstructed Src Image
Interpolation Process
Reconstructed Dst Image
Destination Image
Source Image
Reconstructed Src Image
Interpolation Process
Reconstructed Dst Image
Destination Image

As shown in the gif image, the cats gradually transit from the source cats appearance to the destination cats appearance. The generated fake transition images look pretty realistic. But we can also conclude that the interpolation performance depends on the quality of latent vectors (how well they can reconstruct the original cat). The first row is an example of good performance, while the latent vector of the destination image in the second row can not perfectly reconsturct the destination cat, the transition process doesn't end up perfectly matching the destination cat.

Part 3: Scribble to Image

Next, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem, but could be many other things as well.

Color Scribble Constraints: Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image $s \in \mathbb{R}^{d}$ with a corresponding mask $m \in (0,1) d^{d}$. Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like $m_{i} x_{i}=m_{i} s_{i}$. Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to $$z^{*}=\arg \min _{z}\|M * G(z)-M * S\|^{2},$$ where $*$ is the Hadamard product, $M$ is the mask, and $S$ is the sketch. The results are shown below. (Since the Scribble to Image generation process has a lot of randomness, we only show some good ones here.)

Scribble

Generated Cat

From left to right, the scribble becomes denser, and the generated cats transits from more realistic style to more painting style (more blurred and pale). It aligns with the intuition that denser scribble with more color would add more constraints on the latent vector, making it deviates more from the original cat manifold space, thus leading to cat images showing color with high saturability, like what in the scribbles.