<strong>When Cats Meet GANs</strong>

This assignment explores neural style transfer, a technique to manage content and style as two separate entities. For example, we can transform our favorite photos into a Van Gogh painting. The algorithm takes in a content image, a style image and another input image (or white noise). The input image is optimized to match the previous two target images in content and style distance space.

We follow the two seminal work by Gatys et al. (2015) Gatys_a and Gatys_b. Click here for the official PyTorch tutorial. The assignment has several parts, starting from optimizing in content space, then over style space, and then finally combining the two to perform neural style transfer.

Content Reconstruction

For this part of the assignment, we implement content-space loss and optimize a random noise with respect to the content loss only. Content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as $X^l$ and that of target content image C as $C^l$. The content loss is defined as the squared L2-distance of these two features: $$L_{content}(\overrightarrow{x}, \overrightarrow{c}, l) = \frac{1}{2}\sum_{i,j}(X_{ij}^l - C_{ij}^l)^2$$ We can select at what level within the VGG-19 network we extract features to represent content. It is known that higher layers capture content better than lower layers, where the content reconstruction at these levels is almost perfect.

Original image

conv_1_1

conv_1_2

conv_2_1

conv_2_2

conv_3_1

conv_3_2

conv_3_3

conv_3_4

conv_4_1

conv_4_2

conv_4_3

conv_4_4

conv_5_1

conv_5_2

conv_5_3

conv_5_4

all

Above is the content loss reconstruction over all convolutional layers in VGG-19. We can see that lower convs maintain same spatial and texture features than the original, and from the higher conv layers of the 4th block, we see how texture and colors get distorted until becoming a noise ball. There is a drastic leap between conv_5_1 and conv_5_2.

Texture Synthesis

In this section, we will see the effect of capturing the style of an image using style-space loss. How do we measure the distance of the styles of two images? The Gram matrix is used as a style measurement. Gram matrix is the correlation of two vectors on every dimension. Specifically, denote the k-th dimension of the Lth-layer feature of an image as $f^L_k$ in the shape of (N,K,H∗W). Then the gram matrix is $G=f^L_k(f^L_k)^T$ in the shape of (N, K, K). The idea is that two of the gram matrix of our optimized and predicted feature should be as close as possible.

From the paper: To generate a texture that matches the style of a given image, we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimizing the mean-squared distance between the entries of the Gram matrix. Let $\overrightarrow{a}$ and $\overrightarrow{x}$ be the original image and the image that is generated and $A^l$ and $G^l$ their respective style representatinos in layer $l$. The contribution of that layer to the total loss is: $$E_l = \frac{1}{4N^2_lM^2_l}\sum_{i,j}(G^l_{ij}-A^l_{ij})^2$$ and the total loss is: $$L_{style}(\overrightarrow{a}, \overrightarrow{x}) = \sum^L_{l=0}w_lE_l$$ where $w_l$ are weighting factors of the contribution of each layer to the total loss.

Below, we display the battery of images corresponding to the style reconstruction at extracting features at all conv layers. We see how from conv_1 to conv_3_1, the color is preserved. From conv_3_2, color starts gearing towards noise. In this case, lower conv layers seem to yield a finer granularity of the texture, while higher conv layers seem to be composed of larger strokes.

Original image

conv_1_1

conv_1_2

conv_2_1

conv_2_2

conv_3_1

conv_3_2

conv_3_3

conv_3_4

conv_4_1

conv_4_2

conv_4_3

conv_4_4

conv_5_1

conv_5_2

conv_5_3

conv_5_4

all

We can also group several feature layers together and see what effect has on style. The best working texture is a bit subjective at this point, although to me, conv_1_1, conv_1_2 and conv 2_1 together is my favorite.

Original image

conv_1_1

conv_1_1 and conv 1_2

conv_1_1, conv_1_2, conv_2_1

conv_1_1, conv_1_2, conv_2_1, conv_2_2, conv_3_1

Original image

Noise 1. conv_1_1, conv_1_2, and conv_2_1

Noise 2. conv_1_1, conv_1_2, and conv_2_1

Original image

Noise 1. conv_1_1, conv_1_2, and conv_2_1

Noise 2. conv_1_1, conv_1_2, and conv_2_1

Original image

Noise 1. conv_1_1, conv_1_2, and conv_2_1

Noise 2. conv_1_1, conv_1_2, and conv_2_1

Style Transfer

Ablation on Frida Kahlo and Falling water house (FLR)

Style image

Content image

$\lambda=1000$. Style convs [1_1 to 3_1]. Content conv 2_2

$\lambda=10000$. Style convs [1_1 to 3_1]. Content conv 2_2

$\lambda=10000000$. Style convs [1_1 to 2_1]. Content conv 4_2

$\lambda=1000000000$. Style convs [1_1 to 2_1]. Content conv 4_2

$\lambda=10000000000$. Style convs [1_1 to 2_1]. Content conv 4_2

$\lambda=1000$. Style convs [1_1 to 3_1]. Content conv 2_2

$\lambda=10000$. Style convs [1_1 to 3_1]. Content conv 2_2

$\lambda=10000000$. Style convs [1_1 to 2_1]. Content conv 4_2

$\lambda=1000000000$. Style convs [1_1 to 2_1]. Content conv 4_2

$\lambda=10000000000$. Style convs [1_1 to 2_1]. Content conv 4_2

From left to right, the first two images use a lower convolution layer as content feature image and a larger set of convolution feature maps to capture the style of the image. We need less weight in the style images in comparison to using higher content feature map (3 rightmost images). This is due to the fact that the last 3 images use 3 convolutional feature layers to capture style (1_1, 1_2 and 2_1), which makes the texture have a smaller granularity and therefore, we need to increase the style weight much more. The image at the middle does not capture the colors of the style image. We also see how in the last 2 images, the horizontal lines of the building are more wobbly, due to capturing content perhaps at a higher layer (4_2). Generally, the difference between having noise or content as input image is not that great when using low content feature layers. However, when using higher content feature layers, using content image as input image preserves the shape of the content better than having random noise as input (we can see this in the 3rd and 4th images). Lastly, content image as input explodes with such high style weight.

Ablation on The Scream and Phipps Conservatory

Style image

Content image

First, let's see the effect of changing the content layer only, maintaining the rest of the parameters. Style weight 1000, style layers from conv_1_1 to conv_3_1.

Content layer: conv_2_2

Content layer: conv_3_1

Content layer: conv_3_2

Content layer: conv_4_1

From left to right, we use as content feature layer conv_2_2, conv_3_1, conv_3_2 and conv_4_1. We see how the higher the content layer is, the more it preserves the original content image. We will choose the third image (content feature layer conv_3_2) to continue our ablation study.

Let's fix the content feature layer (conv_3_2) and modify the style feature layer, maintaining style weight at 1000.

Style layer: conv_1_1

Style layer: conv_2_1

Style layer: conv_3_1

Style layer: conv_4_1

Style layer: conv_1_1 and 1_2

Style layer: conv_1_1, 1_2 and 2_1

Style layer: conv_1_1, 1_2, 2_1 and 2_2

Style layer: conv_1_1, 1_2, 2_1, 2_2 , 3_1 and 3_2

We can see how low style layers do not capture properly the texture of The Scream. We need to wait until conv_3_1 to see such effect. Style layer 4_1 covers the content image completely, so lower style weight would be better for such higher style layers.

In general, this gives us an intuition on the type of style and content weights we would need depending on what convolutional feature maps we use for style and content. Summarizing, higher content convolutional layers tend to preserve the content image better than lower layers. Likewise, higher style convolutional layers tend to capture longer strokes loaded with stylistic attributes. Let's see the final images, below, we just vary the style weight. Top row shows results with white noise as image input and bottom row shows results with content image as image input.

Style weight: 100. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Style weight: 1000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Style weight: 10000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Style weight: 1000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Style weight: 10000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Style weight: 100000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Neural Style Transfer

Overview

Content Reconstruction

Texture Synthesis

Style Transfer

Ablation on Frida Kahlo and Falling water house (FLR)

Ablation on The Scream and Phipps Conservatory

Grid of style / content images