**Neural Style Transfer**

<div>
  <p style="float: left"> *CMU 16-726 Spring 2021 Assignment #4* </p>
  <p style="float: right"> Name: Juyong Kim</p>
</div>

![](images/inputs/content/phipps.jpeg height=200px) ![](images/inputs/style/starry_night.jpeg height=200px) ![](images/content_best/starry_night_phipps_stytrans2.jpg height=200px)

(#) Introduction

In this assignment, we implement neural style transfer that represents the content of one image with the style of another image with a pretrained CNN (convolutional neural network).
For example, generate cat images in Ukiyo-e style.
Following the first attempt of neural style transfer [#Gatys15a, #Gatys15b], we optimize an image which minimizes the combination of content loss and style loss, both of which are defined in terms of CNN feature maps.
But different from these papers, we normalize each channel by their mean and variance to make the optimization of the combined loss to be stable and easily configurable.

The assignment has three part.
In the first part, to get a sense of the gradient descent of image, we implements content reconstruction by optimizing an image with respect to the content loss.
The content loss is defined as the $\ell_2$ distance of the feature maps.
In the second part, we perform texture synthesis by optimizing the style loss defined by the $\ell_2$ distance of the Gram matrix of featre maps.
Combining both loss terms, in the last part, we finally perform neural style transfer.

**Click images to see larger**

(#) Part 1: Content Reconstruction

At different layers of the CNN are intermediate feature maps of different sizes.
Given a content image $C$ and a convolutional neural network, we obtain the feature maps at each layer $l$, denoted as $F_C^l \in \mathbb{R}^{N_l \times M_l}$.
Here, $N_l$ and $M_l$ is the number of channels at layer $l$ and the number of pixels for each channel (Each feature map channel is considered as flattened into 1-D vector).

The image $X$ being optimized also can be feed to produce feature maps $F_X^l$, and the (normalized) squared difference of both feature maps is what we call here as *the content loss*.
$$ \mathcal{L}_{\text{content},l}(X, C) = \frac{1}{2N_lM_l} \| \overline{F}_X^l - \overline{F}_C^l \|_2^2, $$
where $\overline{F}_X^l$ and $\overline{F}_C^l$ are the feature map normalized by the per-channel mean and standard deviation of $F_C^l$.
Here, the content loss can be combined over multiple layers to make the output have similar content in multiple level.
Since the size of the two feature maps should match, the two images should have the same size.


(#) Part 2: Texture Synthesis

Given a style image $S$ and the input image $X$, the Gram matrix $G_S^l \in \mathbb{R}^{N_l \times N_l}$ of the feature maps at layer $l$ can be computed as the inner products of all the pairs of the feature maps:
$$ G_S^l = \frac{1}{M_l} \overline{F}_S^l \overline{F}_S^{l\top}, ~~~ G_X^l = \frac{1}{M_l} \overline{F}_X^l \overline{F}_X^{l\top}, $$
where $\overline{F}_S^l$ and $\overline{F}_X^l$ is the feature map normalized in a similar way as in the former part.
Each element of the Gram matrix represent the (normalized) inner product of a pair of the channels in the layer $l$.
The style loss is defined by the (normalized) squared-difference between the Gram matrices of the input image and the target image.
*The style loss* at layer $l$ can be written as:
$$ \mathcal{L}_{\text{style},l} (X, S) = \frac{1}{2N_l^2} \| G_X^l - G_S^l \|_2^2. $$
Here, similarily to the contet loss, the style loss can be combined over multiple layers to make the output have similar content in multiple level.
Different from the content loss, the style image and the input image do not have to be in a same size, since the gram matrix only depends on the number of channels.

(#) Part 3: Neural Style Transfer

Combining the content loss and the style loss, we can perform a neural style transfer, making the output image to represent the content of one image with the style of another image.
Let $L_\text{content}$ and $L_\text{style}$ be the set of layers used in the content losses and the style losses, repectively.
$$ \mathcal{L}(X, C, S) = \sum_{l \in L_\text{content}} \lambda_{\text{content},l} \mathcal{L}_{\text{content},l}(X, C) + \sum_{l \in L_\text{style}} \lambda_{\text{style},l} \mathcal{L}_{\text{style},l} (X, S), $$
where $\lambda_{\text{content},l}$ and $\lambda_{\text{style},l}$ are the coefficient for each loss.
The relative ratio of coeffient is important since the larger loss will dominate on which aspect the output image should be similar to the target images.


(#) Implementation Details

The implementation heavily depends on the code provided from [the assignment](https://learning-image-synthesis.github.io/assignments/hw4), which uses [PyTorch tutorial on neural style transfer](https://pytorch.org/tutorials/advanced/neural_style_tutorial.html).
For the pretrained CNN to extract features and optimize the input image, we use VGG-19 model [#Simonyan14] imported from `torchvision`.
In all experiments, we used L-BFGS optimizer as suggested in the original paper [#Gatys15a], and we fixed the number of steps to 500 and the learning rate to 1.0.


(#) Results

(##) Part 1: Content Reconstruction

This section describes the result of content reconstruction which optimizes the content loss only.
For each experiment, we imposed the content loss at different stages of VGG-19 layers (`conv1_1`~`conv5_1`).

![Figure [content-recon1-input]: `dancing`](images/inputs/content/dancing.jpg width=200px) ![Figure [content-recon1-layer1]: `conv1_1`](images/recon1/frida_kahlo_dancing_imgrecon.jpg width=200px) ![Figure [content-recon1-layer2]: `conv2_1`](images/recon2/frida_kahlo_dancing_imgrecon.jpg width=200px) ![Figure [content-recon1-layer3]: `conv3_1`](images/recon3/frida_kahlo_dancing_imgrecon.jpg width=200px) ![Figure [content-recon1-layer4]: `conv4_1`](images/recon4/frida_kahlo_dancing_imgrecon.jpg width=200px) ![Figure [content-recon1-layer5]: `conv5_1`](images/recon5/frida_kahlo_dancing_imgrecon.jpg width=200px)
![Figure [content-recon2-input]: `fallingwater`](images/inputs/content/fallingwater.png width=200px) ![Figure [content-recon2-layer1]: `conv1_1`](images/recon1/frida_kahlo_fallingwater_imgrecon.jpg width=200px) ![Figure [content-recon2-layer2]: `conv2_1`](images/recon2/frida_kahlo_fallingwater_imgrecon.jpg width=200px) ![Figure [content-recon2-layer3]: `conv3_1`](images/recon3/frida_kahlo_fallingwater_imgrecon.jpg width=200px) ![Figure [content-recon2-layer4]: `conv4_1`](images/recon4/frida_kahlo_fallingwater_imgrecon.jpg width=200px) ![Figure [content-recon2-layer5]: `conv5_1`](images/recon5/frida_kahlo_fallingwater_imgrecon.jpg width=200px)

Figure [content-recon1-input]~Figure [content-recon2-layer5] show the input content image and the output images made from the content losses.
Minimizing the content loss at the lowest layer (`conv1_1`) reconstructs the input image seemingly perfect.
As the loss layer goes deeper (increasing `X` in the layer `convX_1`), the texture of the input image became similar to noise, leaving only the content of the input (the dancer).
For the purpose of style transfer, we only need to preserve the semantic of the content image.
Therefore we should put the content loss at **higher layers, such as `conv4_1` or conv5_1`**.
One interesting observation is that when the content loss is defined the deepest layer (`conv5_1`), the optimization fails for several images, showing the loss value of `NaN` (Only two, shown above, of the five images in the assignment are succesful in optimizing the content loss at `conv5_1`).

To see the sensitivity to the input, we performed content reconstruction on a same content image twice with different random initializations.
We put the content loss only on the `conv4_1` layer and performed the same optimization except random initializations.
Figure [content-recon-random1-output1]~Figure [content-recon-random2-output2] show the inputs and the outputs of the experiments.
Overall semantics of the two outputs are same, but the fine textures do not match exactly (click to see the images in their original size).
This results implies that we can combine the content loss with other objective functions to change the style of the input keeping the sementic same.

![](images/blank.png width=200px) ![Figure [content-recon-random1-input]: Input `dancing`](images/inputs/content/dancing.jpg width=200px) ![Figure [content-recon-random1-output1]: Output 1](images/random_exp/frida_kahlo_dancing_random_recon1.jpg width=200px) ![Figure [content-recon-random1-output2]: Output 2](images/random_exp/frida_kahlo_dancing_random_recon2.jpg width=200px) ![](images/blank.png width=200px)
![](images/blank.png width=200px) ![Figure [content-recon-random2-input]: Input `wally`](images/inputs/content/wally.jpg width=200px) ![Figure [content-recon-random2-output1]: Output 1](images/random_exp/frida_kahlo_wally_random_recon1.jpg width=200px) ![Figure [content-recon-random2-output2]: Output 2](images/random_exp/frida_kahlo_wally_random_recon2.jpg width=200px) ![](images/blank.png width=200px)


(##) Part 2: Texture Synthesis

Figure [tex-syn1-layer1]~Figure [tex-syn2-layer9] show the results of texture synthesis.
Initializing the input image to random image, we perform optimization of the style loss between the input image and the target images at each layer and different combinations of layers.
For each image, the first row shows the results of texture synthesis with one-layer style loss.
When the style loss is at the first layer (`conv1_1`), only small dots are generated.
As the layer goes deeper, bigger textures of the input image are synthesized, but the small texture loses at the deepest layer and the color looks like a noise image.
The second row shows the results when the style losses of multiple layers (starting from the first layer) are combined together.
The results show various granularities of texture of the style image, all of which looks artistic.

![Figure [tex-syn1-layer1]: `conv1_1`](images/texsyn1/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer2]: `conv2_1`](images/texsyn2/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer3]: `conv3_1`](images/texsyn3/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer4]: `conv4_1`](images/texsyn4/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer5]: `conv5_1`](images/texsyn5/frida_kahlo_dancing_texsyn.jpg width=200px)
![Figure [tex-syn1-input]: `frida_kahlo`](images/inputs/style/frida_kahlo.jpeg width=200px) ![Figure [tex-syn1-layer6]: `conv1_1`~`conv2_1`](images/texsyn6/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer7]: `conv1_1`~`conv3_1`](images/texsyn7/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer8]: `conv1_1`~`conv4_1`](images/texsyn8/frida_kahlo_dancing_texsyn.jpg width=200px) ![Figure [tex-syn1-layer9]: `conv1_1`~`conv5_1`](images/texsyn9/frida_kahlo_dancing_texsyn.jpg width=200px)

![Figure [tex-syn2-layer1]: `conv1_1`](images/texsyn1/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer2]: `conv2_1`](images/texsyn2/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer3]: `conv3_1`](images/texsyn3/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer4]: `conv4_1`](images/texsyn4/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer5]: `conv5_1`](images/texsyn5/starry_night_dancing_texsyn.jpg width=200px)
![Figure [tex-syn2-input]: `starry_night`](images/inputs/style/starry_night.jpeg width=200px) ![Figure [tex-syn2-layer6]: `conv1_1`~`conv2_1`](images/texsyn6/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer7]: `conv1_1`~`conv3_1`](images/texsyn7/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer8]: `conv1_1`~`conv4_1`](images/texsyn8/starry_night_dancing_texsyn.jpg width=200px) ![Figure [tex-syn2-layer9]: `conv1_1`~`conv5_1`](images/texsyn9/starry_night_dancing_texsyn.jpg width=200px)

Similarily to the Part 1, we perform texture synthesis twice on a same input but different random initializations.
We put the style loss on the first four stages (`conv1_1`~`conv4_1`) and run the optimization identically except the initializations.
Figure [tex-syn-random1-input]~Figure [tex-syn-random5-output2] show the results with the inputs on the first row, the results with first noise on the second row, and the results with the second noise on the last row.
As we can easily notice, the output shows the same texture for both initializations, but they do not match in larger semantics.
This result implies that the style loss can work as an objective for the texture synthesis leaving rooms for semantics.

![Figure [tex-syn-random1-input]: `escher_sphere`](images/inputs/style/escher_sphere.jpeg style="max-width:200px;max-height:200px;") ![Figure [tex-syn-random2-input]: `frida_kahlo`](images/inputs/style/frida_kahlo.jpeg style="max-width:200px;max-height:200px;") ![Figure [tex-syn-random3-input]: `picasso`](images/inputs/style/picasso.jpg style="max-width:200px;max-height:200px;") ![Figure [tex-syn-random4-input]: `starry_night`](images/inputs/style/starry_night.jpeg style="max-width:200px;max-height:200px;") ![Figure [tex-syn-random5-input]: `the_scream`](images/inputs/style/the_scream.jpeg style="max-width:200px;max-height:200px;")
![Figure [tex-syn-random1-output1]: Output 1](images/random_exp/escher_sphere_juyong_random_texsyn1.jpg width=200px) ![Figure [tex-syn-random2-output1]: Output 1](images/random_exp/frida_kahlo_juyong_random_texsyn1.jpg width=200px) ![Figure [tex-syn-random3-output1]: Output 1](images/random_exp/picasso_juyong_random_texsyn1.jpg width=200px) ![Figure [tex-syn-random4-output1]: Output 1](images/random_exp/starry_night_juyong_random_texsyn1.jpg width=200px) ![Figure [tex-syn-random5-output1]: Output 1](images/random_exp/the_scream_juyong_random_texsyn1.jpg width=200px)
![Figure [tex-syn-random1-output2]: Output 2](images/random_exp/escher_sphere_juyong_random_texsyn2.jpg width=200px) ![Figure [tex-syn-random2-output2]: Output 2](images/random_exp/frida_kahlo_juyong_random_texsyn2.jpg width=200px) ![Figure [tex-syn-random3-output2]: Output 2](images/random_exp/picasso_juyong_random_texsyn2.jpg width=200px) ![Figure [tex-syn-random4-output2]: Output 2](images/random_exp/starry_night_juyong_random_texsyn2.jpg width=200px) ![Figure [tex-syn-random5-output2]: Output 2](images/random_exp/the_scream_juyong_random_texsyn2.jpg width=200px)


(##) Part 3: Content Reconstruction

To figure out the best way to combining the content loss and the style loss, we test various configurations of content loss layers, style loss layers, and the loss weights as in Table [tab:hyperparams].
Since the content loss captures only high level semantics of the content image when it is applied at the higher loss, we tested applying it to two deepest layer.
Also, the style loss captures the style when it is applied over multiple layers from the lowest layer, so we considered using it to first four or all layers.
For the loss weights, we fixed the content loss weights to be 1.0 and change the relative magnitude of the style loss weights.

         Hyper-param         |                      Values
-----------------------------|------------------------------------------------
$L_{\text{content}}$         | {`conv4_1`, `conv5_1`} or {`conv5_1`}
$\lambda_{\text{content},l}$ | 1.0
$L_{\text{style}}$           | {`conv1_1`~`conv5_1`} or {`conv1_1`~`conv4_1`}
$\lambda_{\text{style},l}$   | 1.0, 100.0, or 10000.0
[Table [tab:hyperparams]: Hyper-parameters for neural style transfer]

Also, suggested in the assignment, we tested two ways of initializing the input image: 1) random initialization and 2) initialization to the content image.
The best hyper-parameters are chosen as follows:

- Content init.: $L_{\text{content}}=${`conv5_1`}, $\lambda_{\text{content},l}=1.0$, $L_{\text{style}}=${`conv1_1`~`conv4_1`}, $\lambda_{\text{style},l}=10^2$
- Random init.: $L_{\text{content}}=${`conv4_1`, `conv5_1`}, $\lambda_{\text{content},l}=1.0$, $L_{\text{style}}=${`conv1_1`~`conv4_1`}, $\lambda_{\text{style},l}=10^4$

Below are the experimental results of the experiments under the two initialization methods, where we choose the best configuations from.

(###) Setting 1: Random initialization

Figure [exp11-input]~Figure [exp12-results] show some of the inputs and the results of neural style transfer with random initialization.
We can see that the style of the style input appears more strongly as more layers are involved in the style loss and the style loss weight increases.
Overall, the best results are obtained when we use **the content loss on both `conv4_1` and `conv5_1` layers and the style loss is applied to first four layers with $\lambda_\text{style}=10^2$**.

![Figure [exp11-input]: Style transfer inputs](images/stytrans_exp/frida_kahlo_dancing.jpg height=550px) ![Figure [exp11-results]: The results of style transfer with various hyper-parameters. Best result is marked red.](images/stytrans_exp/frida_kahlo_dancing_stytrans.jpg height=550px)
![Figure [exp12-input]: Style transfer inputs](images/stytrans_exp/starry_night_tubingen.jpg height=450px) ![Figure [exp12-results]: The results of style transfer with various hyper-parameters. Best result is marked red.](images/stytrans_exp/starry_night_tubingen_stytrans.jpg height=450px)

(###) Setting 2: Content image initialization

Figure [exp21-input]~Figure [exp22-results] show some of the inputs and the results of neural style transfer with initialization to the content image.
When the input image is initialized to the content image, the best results are obtained when we use **the content loss only on `conv5_1` layers and the style loss is applied to first four layers with $\lambda_\text{style}=10^4$**.

![Figure [exp21-input]: Style transfer inputs](images/stytrans_exp/frida_kahlo_dancing.jpg height=550px) ![Figure [exp21-results]: The results of style transfer with various hyper-parameters. Best result is marked red.](images/stytrans_exp/frida_kahlo_dancing_stytrans2.jpg height=550px)
![Figure [exp22-input]: Style transfer inputs](images/stytrans_exp/starry_night_tubingen.jpg height=450px) ![Figure [exp22-results]: The results of style transfer with various hyper-parameters. Best result is marked red.](images/stytrans_exp/starry_night_tubingen_stytrans2.jpg height=450px)

Compared to the former setting (random initialization), the content loss needs much less and the style loss needs much to obtain the best results.
This is expected because the input image already contains the content and the style of the content image at initialization.
In terms of the quality, the two methods of initialization show different extent of style transfer.
The style of the style image reflects in the output much when the input is random noise, and the content looks much clear when the input is the content image, but the both results look natural and can be chosen by users' preference.
In terms of the running time, both methods take same amount of time per step.
Running all 500 steps of optimization takes about 36 seconds on average.

(##) More Results

Here we present the results of style transfer for some style and content images provided in the assigment, as well as my favorite images.
We ran the method with content image initialization and used hyper-parameters chosen in the above section.

For the full results, please visit [this page](full_results.html).

![](images/table_header_sc.png height=140px width=140px) ![](images/inputs/content/dancing.jpg height=190px)![](images/inputs/content/tubingen.jpeg height=190px)![](images/inputs/content/wally.jpg height=190px)![](images/inputs/content/fallingwater.png height=190px)
![](images/inputs/style/frida_kahlo.jpeg  style="max-width:140px; max-height:190px") ![](images/content_best/frida_kahlo_dancing_stytrans2.jpg height=190px) ![](images/content_best/frida_kahlo_tubingen_stytrans2.jpg height=190px) ![](images/content_best/frida_kahlo_wally_stytrans2.jpg height=190px) ![](images/content_best/frida_kahlo_fallingwater_stytrans2.jpg height=190px)
![](images/inputs/style/starry_night.jpeg  style="max-width:140px; max-height:190px") ![](images/content_best/starry_night_dancing_stytrans2.jpg height=190px) ![](images/content_best/starry_night_tubingen_stytrans2.jpg height=190px) ![](images/content_best/starry_night_wally_stytrans2.jpg height=190px) ![](images/content_best/starry_night_fallingwater_stytrans2.jpg height=190px)
![](images/inputs/style/the_scream.jpeg  style="max-width:140px; max-height:190px") ![](images/content_best/the_scream_dancing_stytrans2.jpg height=190px) ![](images/content_best/the_scream_tubingen_stytrans2.jpg height=190px) ![](images/content_best/the_scream_wally_stytrans2.jpg height=190px) ![](images/content_best/the_scream_fallingwater_stytrans2.jpg height=190px)

![](images/blank.png height=140px width=140px) ![](images/table_header_cs.png height=140px width=140px) ![](images/inputs/style/escher_sphere.jpeg style="max-width:170px; max-height:140px") ![](images/inputs/style/frida_kahlo.jpeg style="max-width:170px; max-height:140px") ![](images/inputs/style/starry_night.jpeg style="max-width:170px; max-height:140px") ![](images/blank.png height=140px width=140px)
![](images/blank.png height=140px width=140px) ![](images/inputs/content/juyong.jpg width=170px)![](images/content_best/escher_sphere_juyong_stytrans2.jpg width=170px) ![](images/content_best/frida_kahlo_juyong_stytrans2.jpg width=170px) ![](images/content_best/starry_night_juyong_stytrans2.jpg width=170px) ![](images/blank.png height=140px width=140px)
![](images/blank.png height=140px width=140px) ![](images/inputs/content/fall_couple.jpg width=170px)![](images/content_best/escher_sphere_fall_couple_stytrans2.jpg width=170px) ![](images/content_best/frida_kahlo_fall_couple_stytrans2.jpg width=170px) ![](images/content_best/starry_night_fall_couple_stytrans2.jpg width=170px) ![](images/blank.png height=140px width=140px)


(#) Bibliography

**Bibliography**:
[#Gatys15a]: Gatys, Leon A., et al. 2015. Texture Synthesis Using Convolutional Neural Networks.
In _Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS '15)_, https://arxiv.org/abs/1505.07376

[#Gatys15b]: Gatys, Leon A., et al. 2015. A Neural Algorithm of Artistic Style. _arXiv preprint (2015)_, https://arxiv.org/abs/1508.06576

[#Simonyan14]: Simonyan, Karen, and Zisserman, Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In _3rd International Conference on Learning Representations (ICLR '15)_, https://arxiv.org/abs/1409.1556


<!--- Markdeep & image comparison library - probably no need to change anything below -->
<style class="fallback">body{visibility:hidden;white-space:pre;font-family:monospace}</style><script src="resources/markdeep.min.js"></script><script>window.alreadyProcessedMarkdeep||(document.body.style.visibility="visible")</script><script>document.body.style.maxWidth="1050px"</script>
<script src="resources/jquery.min.js"></script>
<script src="resources/jquery.event.move.js"></script>
<script src="resources/jquery.twentytwenty.js"></script>
<link href="resources/offcanvas.css" rel="stylesheet" type="text/css">
<link href="resources/twentytwenty.css" rel="stylesheet" type="text/css" />
<script>
$(window).load(function(){$(".twentytwenty-container").twentytwenty({default_offset_pct: 0.5});});
</script>