**CMU 16-726 Learning-based Image Synthesis** **Assignment #5** *Title: "GAN Photo Editing"* *Name: Soyong Shin (soyongs@andrew.cmu.edu)* (##) Contents * Part 1: Inverting the Generator * Part 2: Interpolate your Cats * Part 3: Scribble to Image * Part 4: Bells and Whistles (##) Part 1: Inverting the Generator **1.1 Overview** ![figure [sample_editing]: Input](report/Figure1.png) ![**Video 1**: Output](report/gif1.gif) In this part, I implemented an algorithm that reconstructs a given real cat image where Figure 1 shows the example of the algorithm output. Unlike our previous assignment of image reconstruction which we optimized each pixel of the image, here we optimize at latent vector space $\mathcal{z}$ using pretrained generator $\mathcal{G}$. Let the real cat image as $\mathcal{v}$, this optimization is done by the objective function: $$ \mathcal{z} = argmin_{\mathcal{z}} \mathcal{L}(\mathcal{G}(\mathcal{z}, \mathcal{v}) $$ ![figure [architecture_stylegan]: Architecture of StyleGAN network](report/Figure2.png) I also analyzed the type of latent spaces, comparing $\mathcal{z}$, $\mathcal{w}$, $\mathcal{w+}$. Where $\mathcal{z}$ is the normal noise generated by random function. For using StyleGAN, this random noise first passes through fully-connected layers and generate latent space $\mathcal{w}$. The reason of using this fully-connected layer is that normal noise $mathcal{z}$ is entangled so that it is hard to control specific style of the feature, so that the network maps it to new latent space $\mathcal{w}$ which is disentangled. Then the $mathcal{w}$ is fed into different stage of Convolutional Neural Networks (CNNs) after regularizing with AdaIN algorithm. The difference between $mathcal{w}$ and $mathcal{w+}$ is that $mathcal{w}$ is what we input the same latent vector into all 12 stages of CNN, while $mathcal{w+}$ consist of 12 different vector fed into each stage. Thus, it means I choose which latent space to optimize.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.2 Loss Function** For calculating loss, I implemented two different loss functions and build the entire loss as the linear combination of two functions. The first loss function is MSELoss $L_{mse}$, mean squared error of two images: $$ L_{mse} (\mathcal{z}; \mathcal{G}, \mathcal{v}) = \frac{1}{WH} \sum_{i\in W} \sum_{j\in H} (\mathcal{G}(\mathcal{v})_{i, j} - \mathcal{v}_{i, j})^2 $$ I further usee another loss called PerceptualLoss $L_{perc}$ which is identical with content loss that we used for style transfer (Assignment 4). For transferring style of two images, we used deeper convolutional network for calculating content loss since highly detailed feature of content image is not something we wanted to preserve. However, for this algorithm, I chose shallow network since the purpose of this loss is to capture the detailed feature. I used the first convolutional block (denote as $f^\phi(\cdot)$ of VGG-19 network pretrained on ImageNet dataset. $$ L_{perc} (\mathcal{z}; \mathcal{G}, \mathcal{v}) = \frac{1}{WH} \sum_{i\in W} \sum_{j\in H} (f^\phi (\mathcal{G}(\mathcal{z}))_{i, j} - f^\phi (\mathcal{v})_{i, j})^2 $$

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.3 Results** *Compare two GANs* ![figure [result_latent]: Comparison of different Generators](report/Figure3.png) Figure 3 illustrates the qualitative difference of the reconstructed image using VanillaGAN and StyleGAN. Although the quality is not noticeably significant, StyleGAN seems to have better performance for reconstructed shape of the cat face. For this comparison, I used perceptual loss weight $w_{perc} = 0.1$.

*Compare latent spaces* ![figure [result_latent]: Comparison of different latent spaces](report/Figure4.png) Figure 4 shows the qualitative comparison of reconstructed images using different types of latent space. For this, I used 128x128 StyleGAN network and perceptual loss weight $w_{perc} = 0.1$. Indeed, there is no significant difference between each space.

*Compare loss types* ![figure [result_latent]: Comparison of loss combination](report/Figure5.png) Figure 5 shows how reconstructed results vary by the linear combination of two losses as the axis of perceptual loss weight. I fixed the other condition as 128x128 StyleGAN and latent space at normal noise $\mathcal{z}$. In my opinion about this qualitative comparison, using perceptual loss clearly increases the quality and weight $w_{perc} = 0.1$ seems the best.

*Other results* ![figure [other_results]: Other results for Inverting Generator](report/Figure6.png) For this experiment, I use StyleGAN with latent space $\mathcal{w+}$ and $w_{perc} = 0.1$.

(##) Part 2: Interpolate your Cats **2.1 Overview** ![figure [sample_interpolation]: Input 1](report/Figure7.png) ![**Video 1**: Output](report/gif2.gif) ![figure [sample_interpolation]: Input 2](report/Figure8.png) In this part, I implemented interpolation between two images. One basic way is just linearly interpolate each pixel value of two images, which is definitely not a good idea and I won't try this at all. This is because entire dimension of image is much higher than dimension of realistic cat face. On the other words, averaged pixel set of two cat images will not return cat. However, the trained generator $\mathcal{G}$ learned distribution of cat face, will map manifold latent space into image. We assume this latent space is representing distribution of realistic image, and thus, if we pick reasonable value (not outlier that too far from latent distribution), generator $\mathcal{G}$ will generate realistic image. Let's say we have two input images $\mathcal{v}_1, \mathcal{v}_2$. Following the algorithm of part 1, we can get the optimized latent vectors $\mathcal{z}_1 = \mathcal{G}^{-1}(\mathcal{v}_1)$, $\mathcal{z}_2 = \mathcal{G}^{-1}(\mathcal{v}_2)$. Then we can now interpolate the images by interpolating the latent vectors. $\mathcal{z}' = w \mathcal{z}_1 + (1-w) \mathcal{z}_2$. Then the interpolated image is $\mathcal{v}' = \mathcal{G}(\mathcal{z}')$.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.2 Results** Here I show several examples of illustration. Note that I use 128x128 StyleGAN with latent vector $\mathcal{w+}$ and perceptual weight $w_{perc} = 0.1$. ![figure [sample_interpolation]: Input 1](report/Figure9.png) ![**Video 1**: Output](report/gif3.gif) ![figure [sample_interpolation]: Input 2](report/Figure10.png) ![figure [sample_interpolation]: Input 1](report/Figure11.png) ![**Video 1**: Output](report/gif4.gif) ![figure [sample_interpolation]: Input 2](report/Figure12.png)

(##) Part 3: Scribble to Image **3.1 Overview** ![figure [sample_editing]: Sample result of photo editing](report/Figure13.png) Next, I implemented an algorithm the edits photo with given constraint such as sketch or color scribble. Here, I will illustrate the constraint with scribbles. This method is mostly similar with Part 1, inverting generator that we optimize latent space. However, unlike part 1, we are not reconstructing image by comparing with real image, we build totally new image that meets the given constraints. As Figure 13, we have color scribble (left) $\mathcal{s}$, and the mask of it (middle) $\mathcal{m}$. The purpose of algorithm is to get optimized latent vector $\mathcal{z}$. Similar to part 1, but one difference is that we have mask $\mathcal{m}$. Therefore, for calculating loss, I used Hadamard product ( * ) as: $$ \mathcal{z} = argmin_{\mathcal{z}} \mathcal{L}(\mathcal{m} * \mathcal{G}(\mathcal{z}), \mathcal{m} * \mathcal{s}) $$

------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.2 Use of latent regularization** ![figure [sample_editing]: Effectiveness of latent vector regularization](report/Figure14.png) One problem that I encountered on this task is that the output image is sometimes unrealistic. This is because optimized latent vector is not regularized and optimized to the value far away from the mean distribution. Therefore, I added regularization term that directly constrain latent vector $\mathcal{z}$ by L2 norm. Figure 14 shows the effectiveness of this regularization term. I admit both are not very great, however, with regularization shows significant improvement.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.3 Results** ![figure [sample_editing]: Results of photo editing](report/Figure15.png) Here I show some results of color scribble to image algorithm. It shows reasonable results for some cases (row 1 and 2) but also shows very poor results (row 3 and 4) as well. Although with regularization term, the result is not always realistic. (##) Part 4: Bells and Whistles **4.1 Higher resolution** Here I compare the results of different resolution (64x64, 128x128, 256x256). The target image is original image of 256x256 resolutioin. ![figure [sample_editing]: Results of with higher resolution images](report/Figure16.png)

------------------------------------------------------------------------------------------------------------------------------------------------------------ **4.2 Effect of discriminator** ![figure [sample_editing]: Results of with higher resolution images](report/Figure17.png) From our previous assignment (Assignment 3), we trained discriminator $\mathcal{D}$ that discriminates grumpyB cat. Since our given data is same species, I use that pretrained discriminator to penalize unrealistic latent vector during optimization loop. Figure 17 shows the qualitative comparison with using discriminator versus simple L2 regularization in latent space. Although this never solves perfectly, it significantly improves the quality of the result. Using patch discriminator might help as well, it is something we can try next.