proj5_wentaiz

16-726 Learning-Based Image Synthesis, 2021 Spring

Project 5: GAN Photo Editing

Teddy Zhang (wentaiz)

Overview

In the few past years, the quality of images synthesized by GANs has increased rapidly. Compared to the seminal DCGAN framework in 2015, the current state-of-the-art GANs can synthesize at a much higher resolution and produce significantly more realistic images. Among them, StyleGAN makes use of an intermediate W latent space that holds the promise of enabling some controlled image modifications. Image modifications are more exciting when it becomes possible to modify a given image rather than a randomly GAN generated one. This leads to the natural question if it is possible to embed a given photograph into the GAN latent space.

In this project, we will implement a few different techniques to manipulate images on the manifold of natural images. First, we will invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the project, we will take a hand-drawn sketch and generate an image that fits the sketch accordingly.

Content Reconstruction

In this part, we are trying to optimize a noise input latent code so that the loss with a given target image is minimized. The process can be viewed as the projection of the target image on a low dimensional manifold.

Here, we are going to explore the influence of 3 major component in the algorithm:

Type of the generator
Loss
Latent space

Some details of my implementation are:

Loss
- L2 Content Loss: $L_c=|X-T|^2$ , where $X$ and $T$ are the input image and target image.
- Perceptual Loss: $L_p=|X^L-T^L|^2$ , where $f_X^L$ and $T^L$ are $L$ th-layer feature from VGG19 of the input image $X$ and content image $C$ . In the experiment, 'conv_1' and 'conv_4' layers are selected.
- Total Loss: $L = L_c+\lambda \times L_p$ , where $\lambda$ is a weight to adjust the balance between the two losses.
Generator Model
- Vanilla DCGAN and StyleGAN
Latent Space
- Vanilla GAN: z
- StyleGAN: z, w, w+
Optimizer
- FullBatchLBFGS
Hyperparameters
- $\lambda \in \{0, 0.1, 0.01\}$ .
- Number of steps: 1000

We ran the optimization problem solver with 12 different setups mentioned above using the first image from the grumpy cat collection. The resulting reconstructions are shown below:

target	Lambda=0	Lambda=0.01	Lambda=0.1
Vanilla GAN w/ z	6.56 s	6.46 s	6.64 s
StyleGAN w/ z	16.85 s	17.10 s	17.16 s
StyleGAN w/ w	16.58 s	16.79 s	16.49 s
StyleGAN w/ w+	16.41 s	16.64 s	17.19 s

First, we can see that the vanilla model failed to reconstruct the background color compared with other StyleGAN models. But the running time is much shorter due to the simple architecture of the generator model. If we compare the results from column to column, we can tell that the introduction of perceptual loss improves the quality of the detail reconstruction (like the shape of the eyes). But it also makes it a more complex nonconvex optimization problem to solve. When $\lambda$ is large, the resulting images are not as good within the same number of steps. More iterative processes are needed in order to get a better result. Out of all the reconstructions, StyleGAN model using w+ latent space with $\lambda=0.01$ provides the best quality reconstruction. For a image of $64\times 64$ , it takes about 17 seconds to run.

Interpolate your Cats

In this part, we can do arithmetic with the latent vectors we have just found. Interpolating through images via a convex combination of their inverses can provide the intermediate images between two given ones. More precisely, given images $x_1$ and $x_2$ , compute $z_1=G^{−1}(x_1),z_2=G^{−1}(x_2)$ . Then we can combine the latent images for some $\theta\in(0,1)$ by $z'=\theta z_1+(1−\theta)z_2$ and generate it via $x'=G(z')$ .

We ran the optimization with 20 interpolated points in the latent space for the best model identified above. 10000 steps are used during the iterative optimization. The results are demonstrated as a gif between two given images:

Index	Image 1	Interpolation	Image 2
1
2
3
4
5

We can conclude from the results above that the interpolation in most cases are realistic and smooth. In pair 2, the position of the nose is moving up smoothly. However, unrealistic results are generated with pair 3, where the left eye is gradually generated through the interpolation. The major issue for this pair is that the inital optimization result for image 1 is poor (the left eye is blurred).

Scribble to Image

In this part, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem.

Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image $S\in \R^d$ with a corresponding mask $m\in 0,1^d$ . Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like $m_ix_i=m_is_i$ .

Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to:
$z_*=argmin_z||M*G(z)−M*S||^2$

where $*$ is the Hadamard product, $M$ is the mask, and $S$ is the sketch

We ran the optimization in the latent space for the best model identified above. 10000 steps are used during the iterative optimization. The results are demonstrated below:

Index	Scribble	Mask	Result
1
2
3
4
5

From the table demonstrated above, we can see that for denser scribbles like 1,2 and 3, the quality of reconstructions is polarized. When there are similar images in the training set (like 1 and 2), the reconstruction looks realistic. However, when the dense scibble is distinct from the training set, the loss will force the latent variables to generate a very similar image like the given scribble. This means that the final found optimum is too far from the distribution of the real images in the training set. When the scribble is sparse, the optimization problem is easier since there are less constraints. But that also means that less guidance is provided to the network to fill in the details. So we can in case 4, the pose of the cat in the reconstruction is different from that in the scribble although the color constraints are mostly satisfied.

To avoid the optimizing latent vector going too far from the original distribution, we developed a better strategy expained in Bells & Whistles

Bells & Whistles: New constraints on the scribbles

In this part, we revise the algorithm in the previous part and try to make the resulting images more realistic.

To ensure better alignment, we also used perceptual loss.
To avoid the latent variable going too far, we add a regularization to the gradient. $L = L_c+w_pL_p+w_{reg}L_{reg}$ , where $L_c=||M*G(z)−M*S||^2$ , $L_p$ is the perceptual loss, $L_{reg}=||\Delta z||^2$ , $w_{reg}$ is the weight to control the regularization.

Here are the results using this revised algorithm:

Index	Scribble	Mask	Result
1
2
3
4
5

Acknowledgement

The basic methods are inspired by CMU 16-726 and Image2StyleGAN paper.