16-726 Learning-Based Image Synthesis, 2021 Spring

Project 5: GAN Photo Editing

Teddy Zhang (wentaiz)

Overview

In the few past years, the quality of images synthesized by GANs has increased rapidly. Compared to the seminal DCGAN framework in 2015, the current state-of-the-art GANs can synthesize at a much higher resolution and produce significantly more realistic images. Among them, StyleGAN makes use of an intermediate W latent space that holds the promise of enabling some controlled image modifications. Image modifications are more exciting when it becomes possible to modify a given image rather than a randomly GAN generated one. This leads to the natural question if it is possible to embed a given photograph into the GAN latent space.

In this project, we will implement a few different techniques to manipulate images on the manifold of natural images. First, we will invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the project, we will take a hand-drawn sketch and generate an image that fits the sketch accordingly.

Content Reconstruction

In this part, we are trying to optimize a noise input latent code so that the loss with a given target image is minimized. The process can be viewed as the projection of the target image on a low dimensional manifold.

Here, we are going to explore the influence of 3 major component in the algorithm:

Some details of my implementation are:

We ran the optimization problem solver with 12 different setups mentioned above using the first image from the grumpy cat collection. The resulting reconstructions are shown below:

target
Lambda=0
Lambda=0.01
Lambda=0.1
Vanilla GAN w/ z
6.56 s
6.46 s
6.64 s
StyleGAN w/ z
16.85 s
17.10 s
17.16 s
StyleGAN w/ w
16.58 s
16.79 s
16.49 s
StyleGAN w/ w+
16.41 s
16.64 s
17.19 s

First, we can see that the vanilla model failed to reconstruct the background color compared with other StyleGAN models. But the running time is much shorter due to the simple architecture of the generator model. If we compare the results from column to column, we can tell that the introduction of perceptual loss improves the quality of the detail reconstruction (like the shape of the eyes). But it also makes it a more complex nonconvex optimization problem to solve. When λ\lambda is large, the resulting images are not as good within the same number of steps. More iterative processes are needed in order to get a better result. Out of all the reconstructions, StyleGAN model using w+ latent space with λ=0.01\lambda=0.01 provides the best quality reconstruction. For a image of 64×6464\times 64, it takes about 17 seconds to run.

Interpolate your Cats

In this part, we can do arithmetic with the latent vectors we have just found. Interpolating through images via a convex combination of their inverses can provide the intermediate images between two given ones. More precisely, given images x1x_1 and x2x_2, compute z1=G1(x1),z2=G1(x2)z_1=G^{−1}(x_1),z_2=G^{−1}(x_2). Then we can combine the latent images for some θ(0,1)\theta\in(0,1) by z=θz1+(1θ)z2z'=\theta z_1+(1−\theta)z_2 and generate it via x=G(z)x'=G(z').

We ran the optimization with 20 interpolated points in the latent space for the best model identified above. 10000 steps are used during the iterative optimization. The results are demonstrated as a gif between two given images:

Index
Image 1
Interpolation
Image 2
1
2
3
4
5

We can conclude from the results above that the interpolation in most cases are realistic and smooth. In pair 2, the position of the nose is moving up smoothly. However, unrealistic results are generated with pair 3, where the left eye is gradually generated through the interpolation. The major issue for this pair is that the inital optimization result for image 1 is poor (the left eye is blurred).

Scribble to Image

In this part, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem.

Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image SRdS\in \R^d with a corresponding mask m0,1dm\in 0,1^d. Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like mixi=misim_ix_i=m_is_i.

Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to:
z=argminzMG(z)MS2z_*=argmin_z||M*G(z)−M*S||^2

where * is the Hadamard product, MM is the mask, and SS is the sketch

We ran the optimization in the latent space for the best model identified above. 10000 steps are used during the iterative optimization. The results are demonstrated below:

Index
Scribble
Mask
Result
1
2
3
4
5

From the table demonstrated above, we can see that for denser scribbles like 1,2 and 3, the quality of reconstructions is polarized. When there are similar images in the training set (like 1 and 2), the reconstruction looks realistic. However, when the dense scibble is distinct from the training set, the loss will force the latent variables to generate a very similar image like the given scribble. This means that the final found optimum is too far from the distribution of the real images in the training set. When the scribble is sparse, the optimization problem is easier since there are less constraints. But that also means that less guidance is provided to the network to fill in the details. So we can in case 4, the pose of the cat in the reconstruction is different from that in the scribble although the color constraints are mostly satisfied.

To avoid the optimizing latent vector going too far from the original distribution, we developed a better strategy expained in Bells & Whistles


Bells & Whistles: New constraints on the scribbles

In this part, we revise the algorithm in the previous part and try to make the resulting images more realistic.

Here are the results using this revised algorithm:

Index
Scribble
Mask
Result
1
2
3
4
5

Acknowledgement