**CMU 16-726 Learning-based Image Synthesis** **Assignment #4** *Title: "Neural Style Transfer"* *Name: Soyong Shin (soyongs@andrew.cmu.edu)* (##) Contents * Part 1 Introduction and Overview * Part 2 Content Reconstruction * Part 3 Texture Synthesis * Part 4 Neural Transfer (##) Part 1 Introduction and Overview **1.1 Introduction** ![figure [sample_transfer]: Sample image of Neural Style Transfer](report/Figure1.png) From this assignment, I implemented an algorithm that transfers the style of certain image to the another while preserving the content of that target image. This method, called "Neural Style Transfer" is consist of two parts; content reconstruction and texture synthesis. Content reconstruction is to build an image from noise that has similar contents with a target image, while style synthesis is to generate an image with similar style from the target. Then by integrating two methods, we can finally build the overall architecture that takes two images (one for the style and the other for the content) and returns a new image that has content and style from two input images (See Figure 1).

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.2 Neural network (VGG-19)** This algorithm uses feature extractor of **VGG-19** network architecture which consists of 5 convolutional blocks. Outputs of the convolutional blocks are extracted features of different abstract levels ($L={1, 2, 3, 4, 5}$). For reconstructing content and synthesizing style, the algorithm calculates content and style losses at feature level where selection of the level $L$ is one of the hyper-parameters of the algorithm. ![figure [vgg19_network]: Network Architecture of VGG-19 (Feature Extractor)](report/Figure2.png) Figure 2 shows network architecture and diagram of VGG-19 feature extractor. From here, I will use feature at level L $F^L$ that is the output of intermediate convolutional neural network marked as "Level $L$". Note that this VGG-19 feature extractor is pre-trained by imageNet dataset.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.3 Code Implementation** To use different level of features for experiments, I duplicated VGG-19 feature extractor while inserting content and style losses at intermediate feature levels. Each convolutional layer (*Conv2D*) comes before activation function (*ReLU*) and I implemented a function ***get_model_and_losses*** that inserts losses right after the final convolutional layer of each convolutional block. The code implementation is as below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def get_model_and_losses(cnn, style_img, content_img,
                         content_layers=content_layers_default,
                         style_layers=style_layers_default):
    
    """Duplicating CNN model while inserting losses at given feature level.
        
        Args.
            cnn: VGG-19 feature extractor pre-trained on ImageNet
            style_img: image from which synthesize tecture
            content_img: image from which reconstruct content
            content_layers: levels of feature to use for content reconstruction
            style_layers: levels of feature to use for texture synthesis
        Return.
            model: duplicated model with losses inserted
            style_losses: losses layer of texture synthesis
            content_losses: losses layer of content reconstruction
    """
 
    cnn = copy.deepcopy(cnn)
    content_losses = []
    style_losses = []
    
    normalization = Normalization() 
    model = nn.Sequential(normalization)
 
    block_idx, layer_idx = 11
    end_of_blocks = ['conv_1_2''conv_2_2''conv_3_3''conv_4_4''conv_5_4']
    for layer in cnn.children():
        if isinstance(layer, nn.Conv2d):
            # 2D convolutional networks
            name = 'conv_%d_%d'%(block_idx, layer_idx)
            
        elif isinstance(layer, nn.ReLU):
            # relu layer
            name = 'relu_%d_%d'%(block_idx, layer_idx)
            layer = nn.ReLU(inplace=False)
            
            if 'conv_%d_%d'%(block_idx, layer_idx) in end_of_blocks:
                # when current conv layer is the final layer of each block
                # we add content and style loss here
               
                if 'conv_%d'%block_idx in content_layers:
                    # add content layer
                    target = model(content_img).detach()
                    content_loss = ContentLoss(target)
                    model.add_module('content_loss_%d'%block_idx, content_loss)
                    content_losses.append(content_loss)
 
                if 'conv_%d'%block_idx in style_layers:
                    # add style layer
                    target = model(style_img).detach()
                    style_loss = StyleLoss(target)
                    model.add_module('style_loss_%d'%block_idx, style_loss)
                    style_losses.append(style_loss)
 
                if max(content_layers + style_layers) == 'conv_%d'%block_idx:
                    break
            
            layer_idx += 1
 
        elif isinstance(layer, nn.MaxPool2d):
            # pooling layer
            name = 'pool_%d'%(block_idx)
            block_idx += 1
            layer_idx = 1
 
        else:
            NameError, "unexpected layer name appeared!"
 
        model.add_module(name, layer)
 
    return model, style_losses, content_losses
cs


(##) Part 2 Content reconstruction **2.1 Overview** ![figure [content_input1]: Dancer](report/Figure3_1.jpg) ![figure [content_input1]: Falling Water](report/Figure3_2.png) ![figure [content_input1]: Garden](report/Figure3_3.jpeg) ![figure [content_input1]: Village](report/Figure3_4.jpeg) ![figure [content_input1]: Dog](report/Figure3_5.jpg) Each image has its own content. It can be *beautiful ballerina (Figure 3)*, *marvelous architecture (Figure 4)*, *Phipps garden (Figure 5)*, *german village (Figure 6)*, or *cute dog (Figure 7)*. In this part, I am going to build an algorithm that reconstruct the given image using neural networks. This method is pretty simple. I first generated random noise $X$ that has identical shape with the given image $C$. Then optimize the random noise using loss term defined at level $L$ ($F^L_X, F^L_C$). The loss was set as *MSELoss* as below: $$Loss = w_c ||F^L_X - F^L_C||^2$$ where $w_c$ is content weight term. The intuition is that **if image $C$ and $X$ have similar contents, the extracted features of images should be similar as well**.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.2 Code implementation** The content loss layer (*ContentLoss*) implemented at ***style_and_content.py*** has very simple structure. It saves $F^L_C$ as target value and forward pass is just calculating $L2$ distance between $F^L_X$. In order to obtain better stability of optimization loop, I calculate loss after normalized both $F^L_C$ and $F^L_X$ with target feature mean and standard deviation. The code implementation of *ContentLoss* is as below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class ContentLoss(nn.Module):
 
    def __init__(self, target,):
        super(ContentLoss, self).__init__()
        # Normalize target and input for each layer
        _target = target.detach()
        self.mean = _target.mean((23), keepdim=True)
        self.std = _target.std((23), keepdim=True)
        self.target = (_target - self.mean) / self.std
 
 
    def forward(self, input):
        # Forward pass
       
        self.loss = F.mse_loss((input - self.mean)/self.std, self.target)
       
        return input
cs


------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.3 Experimental setting** For content reconstruction task, I conducted experiments with single layer loss but at different levels $L = 1, ... 5$. The optimization takes maximum steps of 500 using LBFGS optimizer with learning rate of $10^{-1}$. I used content weight $w_c = 1$.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.4 Results** ![figure [content_result]: Results of content reconstruction at different feature levels](report/Figure4.png) Figure 8 shows the results of content reconsruction task. Images at first column are input images, and each following collumns stands for results from different feature levels. Looking at the Level 1, which compares feature after two convolutional layers (see Figure 2), the output is highly similar to the input images (it is not even distinguishable). However, as the feature levels go deeper, the reconstructed output has higher noise and preserving few information of content. For very deep levels such as *Level 4* and *Level 5*, the output is merely recognizable, which is not sufficient for our algorithm since we want to preserve semantic content of the image $X$. At the bottom right corner (Falling Water with Level 5), the output is just black since the optimization loop returns **NaN value** from certain steps. This happens often while conducting experiments at higher levels, but it is fixed when I use smaller learning rate. But I did not identify the exact reason of this error yet. ![figure [content_result]: Results of content reconstruction using two different noises](report/Figure4_2.png) Figure 9 shows the content reconstruction result using different noises (but both are generated from Gaussian distribution) to see the effect of initialization. For this, I took layer level $L=3$. Of course it might have difference between two results from different noises, we can see that overall level of content reconstruction is mostly identical when using same feature level. (##) Part 3 Texture Synthesis **3.1 Overview** ![figure [style_sample1]: Etching](report/Figure5_1.jpeg) ![figure [style_sample2]: Frida Kahlo](report/Figure5_2.jpeg) ![figure [style_sample3]: Picasso](report/Figure5_3.jpg) ![figure [style_sample4]: van GogH](report/Figure5_4.jpeg) ![figure [style_sample5]: Munch](report/Figure5_5.jpeg) Each image also has its own style especially for paintings. This style of image can be the technique of drawing such as *Etching (Figure 10)*, or style of painters like *Frida Kahlo (Figure 11)*, *Picasso (Figure 12)*, *van GogH (Figure 13)*, and *Munch (Figure 14)*. We feel **style** of each image by looking at the overall feature of images not certain region of them. Thus we know that style is something comes from entire pixels but it should not contain local information. From this intuition, loss of texture synthesis is calculated by comparing **gram matrix** which is *dot product* of one matrix. The gram matrix at level $L$, $G^L$ is computed as: $$ G^L_{i, j} = \sum_{k} F^L_{ik} F^L_{jk} = F^L (F^L)^t $$ where $F^L$ is feature map at level $L$, and $k$ is all pixels of the feature map. Then, the texture synthesis loss is given by: $$ Loss = w_s ||g^L_S - g^L_X||^2 $$ here $g^L$ is normalized gram matrix which is divided by the number of feature map pixels. Again, $w_s$ is the style weight term.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.2 Code implementation** The texture synthesis loss layer *StyleLoss* and gram matrix calculator *gram_matrix* is implemented at ***style_and_content.py***. StyleLoss layer takes feature of style target image as input and calculate gram matrix of it and store it as a children value. The forward pass of this layer is then takes feature of reconstructing image as input and compares the gram matrix of both features. Here I also use normalization for better conversion. The code of gram matrix and StyleLoss is implemented as below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def gram_matrix(activations):
    a, b, c, d = activations.size()  # a=batch size(=1)
    features = activations.view(a * b, c * d)
    gram = torch.mm(features, features.T)
 
    normalized_gram = gram.div(a * b * c * d)
 
    return normalized_gram
 
 
class StyleLoss(nn.Module):
 
    def __init__(self, target_feature):
        super(StyleLoss, self).__init__()
        # Normalize feature at each layer
        _target = target_feature.detach()
        self.mean = _target.mean((23), keepdim=True)
        self.std = _target.std((23), keepdim=True)
        self.target = gram_matrix((_target - self.mean) / self.std)
 
    def forward(self, input):
        normalized_gram = gram_matrix((input - self.mean)/self.std)
        self.loss = F.mse_loss(normalized_gram, self.target) 
 
        return input
cs


------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.3 Experimental setting** For texture synthesis task, I conducted experiments with single and multiple layer losses with different levels $L = 1, ... 5$. For multiple layer experiments, I used consequent layers starting from level 1 to $L$. Other hyperparameters were set identical to content reconstruction experiment except for the weight $w_s = 10^6$.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.4 Results** ![figure [style_result]: Results of texture synthesis at different feature levels](report/Figure6.png) Figure 15 shows the results of texture synthesis using different set of feature levels $L$. The first column shows the input images of which the algorithm is extracting the style, and following columns are the results of it. Each column has labeled as different layer number, which stands for the feature level. Each input image has two sub-rows; Single- and Multi-layers. Single-layer means StyleLoss is only activated for single layer at level $L$ where $L$ is represented at layer number. Multi-layers mean the StyleLoss is activated for multiple layers from level 1 to $L$. On the other words, Layer 4 at Multi-layers indicates loss is calculated for level $L=1, 2, 3, 4$. ![figure [style_result]: Results of texture synthesis using two different noises](report/Figure6_2.png) Similar to Figure 9, Figure 15 shows that initializationi of two different noises does not significantly affect on texture extraction. For this task, I used multi-layer with level 3, $L = 1, 2, 3$. (##) Part 4. Neural Transfer **4.1 Overview** Transferring style from one to another can be considered as the combination of **Part 2** and **Part 3**. On the other words, this task can be translated into reconstructing same content with image $C$ and synthesizing texture from image $S$. Therefore, it is simply adding two steps.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **4.2 Results** *4.2.1. Feature levels* ![figure [style_transfer]: Results of neural style transfer at different feature levels](report/Figure7_1.png) For the experiment analyzing the effect of feature levels, I fixed $num\_steps = 400$, $w_c = 1$, $w_s = 10^6$, and $lr = 10^-1$. Figure 17 shows how the style transfer output varies by the loss layers feature level. Although there are no absolute criterion or metric to compare the output, I personally prefer the set of layers, $L=3$ for content reconstruction and $L=1, 2, 3, 4$ for texture synthesis. With higher content layer $L = 4, 5$, the content is not reconstructed enoughly, the output is unrecognizable, looks like just the texture of style input image. With higher style layers $L = 1, ..., 5$, the network extracts too fine texture of the input image, does not seem to represent the overall style. Therefore, I conducted all following experiments with setting of $L=3$ for content reconstruction, $L=1, 2, 3, 4$ for texture synthesis.
*4.2.2. Weights* ![figure [style_transfer]: Results of neural style transfer with different weight ratio](report/Figure7_2.png) Figure 18 shows the effect of weight ratio ($w_s/w_c$) in style transfer result. As the ratio increases (i.e. style loss has higher weight), the output loses the semantic content. On the contrary, as ratio decreases (i.e. content loss has higher weight), output image has clear content but less style. Therefore, setting appropriate value is required for better result. In the figure, ratio value of $10^5$ and $10^6$ looks okay. Considering the fact that initialization with content image guarantees better conservation of content as output (this will be described in 4.2.3), I will select $w_s/w_c = 10^6$ for the following experiments. *4.2.3. Random noise vs. Content image* ![figure [style_transfer]: Results of neural style transfer with different initializations](report/Figure7_3.png) Figure 19 compares the results of style transfer when initializing the input image as random Gaussian noise and copying content image. The results shows that when initializing with random tensor, final output seems more likely to contain style of input style image $S$. On the other hand, initializing with content image $C$ outputs image with more clear contents. This result is quite intuitive since when we initialize with content image, the content loss already meets optimal point. Therefore, instead of reconstructing the image from the bottom, this makes the transfer problem as **adding style to content image**. Thinking that our goal is to transfer style to one another, not to transfer content, initialization with content image is better. Running time of optimization for 500 steps take approximately 31 seconds for normal noise initialization and 31.5 seconds for content initialization. *4.2.3. Other results* ![figure [style_transfer]: Other results of neural style transfer](report/Figure7_4.png) Following the best setting that I have figured out through previous experiments, I conducted more style transfer using various input set from given images. Looking at Figure 20, the results looks quite great. ![figure [style_transfer]: Other results of neural style transfer](report/Figure7_5.png) I also tried other experiments using my own content images and famous paintings searched from Google. The first content image is Pisa Tower where I visited 4 years ago. Second content image is me and my parents, and the last content image is Ferrari sports car I took picture 5 years ago. Then I searched famous painting from Google. The first painting is created by Korean artist, it has somewhat orientalism. Next style input is Renaissance period painting 'Scuola di Atene' by Raffaello Sanzio da Urbino, and the final one is creative painting technique by Paul Jackson Pollock. In general, neural style transfer works so great!