*CMU 16-726 Spring 2021 Assignment #5*
Name: Juyong Kim
![](images/interpolate/6.png width=128px) ![](images/interpolate/7_stylegan_w+_l2_0.1.gif width=128px) ![](images/interpolate/7.png width=128px) ![](images/blank.png) ![](images/draw/3.png width=128px) ![](images/draw/3_stylegan_w+_l1_0.1_0.01_1500.png width=128px)
(#) Introduction
In this assignment, we will implement a framework involving GAN (generative advarsarial network) that manipulates natural images.
Through this assignment, we first find a latent vector that a GAN generator transforms to an input image, interpolate images in the latent vector space, and manipulates latent vectors with image space constraints.
(#) Methods
This section describes the methods of manupulating the latent vectors, or the latent code, of the generator of GAN.
The generator $G$ is a function parameterized by neural network and maps a latent vector $z$ to an image $x = G(Z)$.
In this assignment, we use vanilla GAN and StyleGAN [#Karras19] architecture.
In StyleGAN, the input latent vector $z$ is transformed into an intermediate latent vector $w$ and then this vector is used to generate the image.
![Figure [stylegan]: Architecture of GAN and StyleGAN. In StyleGAN, the input latent vector $z$ is transformed into the intermediate vector $w$, which is then fed into the synthesis network.](images/stylegan.png width=400px)
(##) Part 1: Inverting the Generator
A trained generator can be considered to represent the low-dimensional manifold of the distribution of real images.
Given an input image $x$, we can ``project'' the image onto this manifold, i.e. find the latent vector that generated the input image:
$$ z^* \arg \min_z \mathcal{L}(G(z), x) $$
Here, $\mathcal{L}$ is some choices of loss function, and we performed various class of functions and there combinations:
For inverting StyleGAN, we can optimize the intermediate latent vector $w$.
The intermediate latent vector can be constrained to be same for all AdaIN module (which we call $w$), or different (which we call $w+$).
(###) Implementation Details
* The loss function $\mathcal{L}$ is implemented in `Criterion` class in the code. This class covers $L_1$, $L_2$, perceptual loss w.r.t. the first layer (`conv1_1`) of VGG-19 model [#Simonyan14], and the combination of $L_p$ loss and the perceptual loss. The general definition of the loss used in this part can be represented as:
$$ \mathcal{L}(x, x') = \| x - x' \|_p^p + \lambda_\text{perc} \mathcal{L}_\text{content}(x, x') $$
* The initialization of the latent vector is performed in the `sample_noise` method. The input latent vector $z$ is initialized by either the zero vector or a Normal random vector. The intermediate latent vector $w$ is initialized with the output of the mapping network where the input is either the zero vector or a Normal random vector.
* The optimization w.r.t. the latent vector is performed with the LBFGS method in the `optimize_para` function, where only the loss definition is missing.
(##) Part 2: Interpolate your Cats
After completing the projection part, we can manipulate the latent vectors of real images.
One way of doing this is to interpolate between the latent vectors of real images.
Given two real images $x_1$ and $x_2$, we can compute the latent vectors of them $z_1 = G^{-1}(x_1)$ and $z_2 = G^{-1}(x_1)$.
Then we can generate a real image **between the two images** by generating images with the interpolation of latent vectors $G(\theta z_1 + (1-\theta) z_2)$, where $\theta \in [0, 1]$.
The resulting images are more realistic than the element-wise interpolation in the pixel space.
(###) Implementation Details
* Only the `interpolate` method that interpolates the latent vector is needed to be implemented.
(##) Part 3: Scribble to Image
The optimization of finding the latent vector of $G$ can be performed with the constraints.
The constraint of interest is scribble that we want the output to have similar color on it.
Given a color sketch $S$ and the mask $M$ of the sketch, we can optimize the latent vector to minimize different with the scribble:
$$ z^* = \arg \min_z \| M * G(z) - M * S \|_p^p + \lambda_\text{perc} \mathcal{L}_\text{content}(G(z), S, M) + \frac{\lambda_\text{reg}}{2} \|z\|_2^2, $$
where $*$ is the Hadamard product, or element-wise product, and the $\mathcal{L}_\text{content}$ is re-defined to mask at the VGG feature space.
Here, we introduce the regularization term over the latent vector, since there may be a risk of the resulting vector deviating from the original space too much.
(###) Implementation Details
* The `Criterion` class is modified to compute loss with mask, and the `optimize_para` method is also updated to add the regularization loss.
* The main function, `draw` performs basically same as `project` method, except the mask is used in the loss function.
(#) Results
This section includes the experimental results of each part, including the experiments of tuning the hyper-parameters introduced in the above section.
All the experiments are performed with the image size $64\times 64$, which is the size provided in the dataset and model weights.
For optimization, we ran the LBFGS for 1500 iterations which we think is enough for convergence.
(##) Part 0: Sampling Grumpy Cats
To see the overall image quality of each GAN, we generated images by feeding the random latent vectors into the generators.
Figure [forward-vanilla] and Figure [forward-stylegan] shows some generated grumpy cat images of vanilla GAN and StyleGAN.
Overall, StyleGAN generates cat images of better quality.
![Figure [forward-vanilla]: Vanilla GAN](images/forward/vanilla_sample.png)
![Figure [forward-stylegan]: StyleGAN](images/forward/stylegan_sample_z.png)
(##) Part 1: Inverting the Generator
Figure [proj-vanilla-z-l2-0.0] ~ Figure [proj-stylegan-z-l1-10.0] show the projection results of an input image (Figure [proj-input]) with various models (vanilla GAN / StyleGAN), choices of $L_p$ loss, and the values of the hyper-parameter $\lambda_\text{perc}$.
As we have seen in the subsection above, the overall quality is better on StyleGAN than vanilla GAN, regardless of the types of latent vector being optimized.
For StyleGAN optimization, the details of the image, such as green corner, is reconstructed better when we optimize the intermediate vector directly and all vectors of $w$ are optimized separately ($w+$ rather than $w$).
However, there is a risk of overfitting, i.e. the resulting images leaving outside the real image manifold and the face components being blurred.
![Figure [proj-input]: Real cat image](images/project/0_data.png width=100px)
![Figure [proj-vanilla-z-l2-0.0]: V/$z$/$L_2$/0.0](images/project/0_vanilla_z_l2_0_1500.png) ![Figure [proj-vanilla-z-l2-0.1]: V/$z$/$L_2$/0.1](images/project/0_vanilla_z_l2_0.1_1500.png) ![Figure [proj-vanilla-z-l2-1.0]: V/$z$/$L_2$/1.0](images/project/0_vanilla_z_l2_1_1500.png) ![Figure [proj-vanilla-z-l2-10.0]: V/$z$/$L_2$/10.0](images/project/0_vanilla_z_l2_10_1500.png) ![Figure [proj-vanilla-z-l1-0.0]: V/$z$/$L_1$/0.0](images/project/0_vanilla_z_l1_0_1500.png) ![Figure [proj-vanilla-z-l1-0.1]: V/$z$/$L_1$/0.1](images/project/0_vanilla_z_l1_0.1_1500.png) ![Figure [proj-vanilla-z-l1-1.0]: V/$z$/$L_1$/1.0](images/project/0_vanilla_z_l1_1_1500.png) ![Figure [proj-vanilla-z-l1-10.0]: V/$z$/$L_1$/10.0](images/project/0_vanilla_z_l1_10_1500.png)
![Figure [proj-stylegan-z-l2-0.0]: S/$z$/$L_2$/0.0](images/project/0_stylegan_z_l2_0_1500.png) ![Figure [proj-stylegan-z-l2-0.1]: S/$z$/$L_2$/0.1](images/project/0_stylegan_z_l2_0.1_1500.png) ![Figure [proj-stylegan-z-l2-1.0]: S/$z$/$L_2$/1.0](images/project/0_stylegan_z_l2_1_1500.png) ![Figure [proj-stylegan-z-l2-10.0]: S/$z$/$L_2$/10.0](images/project/0_stylegan_z_l2_10_1500.png) ![Figure [proj-stylegan-z-l1-0.0]: S/$z$/$L_1$/0.0](images/project/0_stylegan_z_l1_0_1500.png) ![Figure [proj-stylegan-z-l1-0.1]: S/$z$/$L_1$/0.1](images/project/0_stylegan_z_l1_0.1_1500.png) ![Figure [proj-stylegan-z-l1-1.0]: S/$z$/$L_1$/1.0](images/project/0_stylegan_z_l1_1_1500.png) ![Figure [proj-stylegan-z-l1-10.0]: S/$z$/$L_1$/10.0](images/project/0_stylegan_z_l1_10_1500.png)
![Figure [proj-stylegan-z-l2-0.0]: S/$w$/$L_2$/0.0](images/project/0_stylegan_w_l2_0_1500.png) ![Figure [proj-stylegan-z-l2-0.1]: S/$w$/$L_2$/0.1](images/project/0_stylegan_w_l2_0.1_1500.png) ![Figure [proj-stylegan-z-l2-1.0]: S/$w$/$L_2$/1.0](images/project/0_stylegan_w_l2_1_1500.png) ![Figure [proj-stylegan-z-l2-10.0]: S/$w$/$L_2$/10.0](images/project/0_stylegan_w_l2_10_1500.png) ![Figure [proj-stylegan-z-l1-0.0]: S/$w$/$L_1$/0.0](images/project/0_stylegan_w_l1_0_1500.png) ![Figure [proj-stylegan-z-l1-0.1]: S/$w$/$L_1$/0.1](images/project/0_stylegan_w_l1_0.1_1500.png) ![Figure [proj-stylegan-z-l1-1.0]: S/$w$/$L_1$/1.0](images/project/0_stylegan_w_l1_1_1500.png) ![Figure [proj-stylegan-z-l1-10.0]: S/$w$/$L_1$/10.0](images/project/0_stylegan_w_l1_10_1500.png)
![Figure [proj-stylegan-z-l2-0.0]: S/$w+$/$L_2$/0.0](images/project/0_stylegan_w+_l2_0_1500.png) ![Figure [proj-stylegan-z-l2-0.1]: S/$w+$/$L_2$/0.1](images/project/0_stylegan_w+_l2_0.1_1500.png) ![Figure [proj-stylegan-z-l2-1.0]: S/$w+$/$L_2$/1.0](images/project/0_stylegan_w+_l2_1_1500.png) ![Figure [proj-stylegan-z-l2-10.0]: S/$w+$/$L_2$/10.0](images/project/0_stylegan_w+_l2_10_1500.png) ![Figure [proj-stylegan-z-l1-0.0]: S/$w+$/$L_1$/0.0](images/project/0_stylegan_w+_l1_0_1500.png) ![Figure [proj-stylegan-z-l1-0.1]: S/$w+$/$L_1$/0.1](images/project/0_stylegan_w+_l1_0.1_1500.png) ![Figure [proj-stylegan-z-l1-1.0]: S/$w+$/$L_1$/1.0](images/project/0_stylegan_w+_l1_1_1500.png) ![Figure [proj-stylegan-z-l1-10.0]: S/$w+$/$L_1$/10.0](images/project/0_stylegan_w+_l1_10_1500.png)
Overall, the best projection result is obtained when we used $\mathbf{w+}$ of **StyleGAN**, with $\mathbf{L_2}$ loss and $\mathbf{\lambda_\text{perc} = 0.1}$.
Figure [proj-0-input] ~ Figure [proj-3-stylegan-z-l2-0.1] show the projection results for some grumpy cat images with this configurations.
![Figure [proj-0-input]: Real image](images/project/0_data.png width=100px) ![Figure [proj-1-input]: Real image](images/project/1_data.png width=100px) ![Figure [proj-2-input]: Real image](images/project/2_data.png width=100px) ![Figure [proj-3-input]: Real image](images/project/3_data.png width=100px)
![Figure [proj-0-stylegan-z-l2-0.1]: S/$w+$/$L_2$/0.1](images/project/0_stylegan_w+_l2_0.1_1500.png width=100px) ![Figure [proj-1-stylegan-z-l2-0.1]: S/$w+$/$L_2$/0.1](images/project/1_stylegan_w+_l2_0.1_1500.png width=100px) ![Figure [proj-2-stylegan-z-l2-0.1]: S/$w+$/$L_2$/0.1](images/project/2_stylegan_w+_l2_0.1_1500.png width=100px) ![Figure [proj-3-stylegan-z-l2-0.1]: S/$w+$/$L_2$/0.1](images/project/3_stylegan_w+_l2_0.1_1500.png width=100px)
(##) Part 2: Interpolate your Cats
With the configuration obtained above, we performed interpolation in the latent vector space.
Again, we compute the intermediate latent vector ($w+$) of StyleGAN with $L_2$ loss and $\lambda_\text{perc}=0.1$.
We can see that the interpolation results reconstruct the input images reasonable and the interpolation naturally transforms one image to another.
![Figure [interp-0-input]: Real image 0](images/interpolate/0.png width=100px) ![Figure [interp-0-input]: Projection 0](images/interpolate/0_stylegan_w+_l2_0.1.png width=100px) ![Figure [interp-1-input]: Interpolation](images/interpolate/1_stylegan_w+_l2_0.1.gif width=100px) ![Figure [interp-1-input]: Projection 1](images/interpolate/1_stylegan_w+_l2_0.1.png width=100px) ![Figure [interp-1-input]: Real image 1](images/interpolate/1.png width=100px)
![Figure [interp-2-input]: Real image 2](images/interpolate/2.png width=100px) ![Figure [interp-2-input]: Projection 2](images/interpolate/2_stylegan_w+_l2_0.1.png width=100px) ![Figure [interp-3-input]: Interpolation](images/interpolate/3_stylegan_w+_l2_0.1.gif width=100px) ![Figure [interp-3-input]: Projection 3](images/interpolate/3_stylegan_w+_l2_0.1.png width=100px) ![Figure [interp-3-input]: Real image 3](images/interpolate/3.png width=100px)
![Figure [interp-4-input]: Real image 4](images/interpolate/4.png width=100px) ![Figure [interp-4-input]: Projection 4](images/interpolate/4_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-5-input]: Interpolation](images/interpolate/5_stylegan_w+_l1_0.1.gif width=100px) ![Figure [interp-5-input]: Projection 5](images/interpolate/5_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-5-input]: Real image 5](images/interpolate/5.png width=100px)
![Figure [interp-6-input]: Real image 6](images/interpolate/6.png width=100px) ![Figure [interp-6-input]: Projection 6](images/interpolate/6_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-7-input]: Interpolation](images/interpolate/7_stylegan_w+_l1_0.1.gif width=100px) ![Figure [interp-7-input]: Projection 7](images/interpolate/7_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-7-input]: Real image 7](images/interpolate/7.png width=100px)
![Figure [interp-8-input]: Real image 8](images/interpolate/8.png width=100px) ![Figure [interp-8-input]: Projection 8](images/interpolate/8_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-9-input]: Interpolation](images/interpolate/9_stylegan_w+_l1_0.1.gif width=100px) ![Figure [interp-9-input]: Projection 9](images/interpolate/9_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-9-input]: Real image 9](images/interpolate/9.png width=100px)
![Figure [interp-10-input]: Real image 10](images/interpolate/10.png width=100px) ![Figure [interp-10-input]: Projection 10](images/interpolate/10_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-11-input]: Interpolation](images/interpolate/11_stylegan_w+_l1_0.1.gif width=100px) ![Figure [interp-11-input]: Projection 11](images/interpolate/11_stylegan_w+_l1_0.1.png width=100px) ![Figure [interp-11-input]: Real image 11](images/interpolate/11.png width=100px)
(##) Part 3: Scribble to Image
We optimized the latent vector under the color scribble constraint given in the assignment.
On top of the best configuation obtained above, we tried different strengths of regulatization hyper-parameter $\lambda_\text{reg}$.
![Figure [draw-0-input]: Scribble 0](images/draw/0.png width=100px) ![Figure [draw-0-0.0]: $\lambda_\text{reg}=0.0$](images/draw/0_stylegan_w+_l1_0.1_0_1500.png width=100px) ![Figure [draw-0-0.01]: $\lambda_\text{reg}=0.01$](images/draw/0_stylegan_w+_l1_0.1_0.01_1500.png width=100px) ![Figure [draw-0-0.03]: $\lambda_\text{reg}=0.03$](images/draw/0_stylegan_w+_l1_0.1_0.03_1500.png width=100px) ![Figure [draw-0-0.1]: $\lambda_\text{reg}=0.1$](images/draw/0_stylegan_w+_l1_0.1_0.1_1500.png width=100px) ![Figure [draw-0-0.3]: $\lambda_\text{reg}=0.3$](images/draw/0_stylegan_w+_l1_0.1_0.3_1500.png width=100px)
![Figure [draw-1-input]: Scribble 1](images/draw/1.png width=100px) ![Figure [draw-1-0.0]: $\lambda_\text{reg}=0.0$](images/draw/1_stylegan_w+_l1_0.1_0_1500.png width=100px) ![Figure [draw-1-0.01]: $\lambda_\text{reg}=0.01$](images/draw/1_stylegan_w+_l1_0.1_0.01_1500.png width=100px) ![Figure [draw-1-0.03]: $\lambda_\text{reg}=0.03$](images/draw/1_stylegan_w+_l1_0.1_0.03_1500.png width=100px) ![Figure [draw-1-0.1]: $\lambda_\text{reg}=0.1$](images/draw/1_stylegan_w+_l1_0.1_0.1_1500.png width=100px) ![Figure [draw-1-0.3]: $\lambda_\text{reg}=0.3$](images/draw/1_stylegan_w+_l1_0.1_0.3_1500.png width=100px)
![Figure [draw-2-input]: Scribble 3](images/draw/2.png width=100px) ![Figure [draw-2-0.0]: $\lambda_\text{reg}=0.0$](images/draw/2_stylegan_w+_l1_0.1_0_1500.png width=100px) ![Figure [draw-2-0.01]: $\lambda_\text{reg}=0.01$](images/draw/2_stylegan_w+_l1_0.1_0.01_1500.png width=100px) ![Figure [draw-2-0.03]: $\lambda_\text{reg}=0.03$](images/draw/2_stylegan_w+_l1_0.1_0.03_1500.png width=100px) ![Figure [draw-2-0.1]: $\lambda_\text{reg}=0.1$](images/draw/2_stylegan_w+_l1_0.1_0.1_1500.png width=100px) ![Figure [draw-2-0.3]: $\lambda_\text{reg}=0.3$](images/draw/2_stylegan_w+_l1_0.1_0.3_1500.png width=100px)
![Figure [draw-3-input]: Scribble 3](images/draw/3.png width=100px) ![Figure [draw-3-0.0]: $\lambda_\text{reg}=0.0$](images/draw/3_stylegan_w+_l1_0.1_0_1500.png width=100px) ![Figure [draw-3-0.01]: $\lambda_\text{reg}=0.01$](images/draw/3_stylegan_w+_l1_0.1_0.01_1500.png width=100px) ![Figure [draw-3-0.03]: $\lambda_\text{reg}=0.03$](images/draw/3_stylegan_w+_l1_0.1_0.03_1500.png width=100px) ![Figure [draw-3-0.1]: $\lambda_\text{reg}=0.1$](images/draw/3_stylegan_w+_l1_0.1_0.1_1500.png width=100px) ![Figure [draw-3-0.3]: $\lambda_\text{reg}=0.3$](images/draw/3_stylegan_w+_l1_0.1_0.3_1500.png width=100px)
![Figure [draw-4-input]: Scribble 4](images/draw/4.png width=100px) ![Figure [draw-4-0.0]: $\lambda_\text{reg}=0.0$](images/draw/4_stylegan_w+_l1_0.1_0_1500.png width=100px) ![Figure [draw-4-0.01]: $\lambda_\text{reg}=0.01$](images/draw/4_stylegan_w+_l1_0.1_0.01_1500.png width=100px) ![Figure [draw-4-0.03]: $\lambda_\text{reg}=0.03$](images/draw/4_stylegan_w+_l1_0.1_0.03_1500.png width=100px) ![Figure [draw-4-0.1]: $\lambda_\text{reg}=0.1$](images/draw/4_stylegan_w+_l1_0.1_0.1_1500.png width=100px) ![Figure [draw-4-0.3]: $\lambda_\text{reg}=0.3$](images/draw/4_stylegan_w+_l1_0.1_0.3_1500.png width=100px)
Figure [draw-0-input] ~ Figure [draw-4-0.3] show the input scribbles and the output results.
For all inputs, we can find an output image which is realistic and corresponds to the scribble.
If the $\lambda_\text{reg}$ is to small, the outputs do not look realistic and stick too much to the input scribbles.
On the other hand, if $\lambda_\text{reg}$ is too big, the outputs look realistic but deviate from the scribbles.
One interesting observation is that for different types of scribbles, the strength of $\lambda_\text{reg}$ needed to make the output both realistic similar to the input is different.
We conjecture that when the scribble is small (e.g. when the stroke is thin), the constraint imposed by the loss functino is small, so the regularization needed is also small.
(#) Bibliography
**Bibliography**:
[#Karras19]: Karras, Tero, Samuli Laine, and Timo Aila. "A Style-Based Generator Architecture for Generative Adversarial Networks." In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR '19)_, https://arxiv.org/abs/1812.04948
[#Simonyan14]: Simonyan, Karen, and Zisserman, Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In _3rd International Conference on Learning Representations (ICLR '15)_, https://arxiv.org/abs/1409.1556