Neural Style Transfer

11-747 Learning-based Image Synthesis Manuel Rodriguez Ladron de Guevara


The Scream Phipps
The Reinessance Phipps
The Kahlo Phipps

Overview

This assignment explores neural style transfer, a technique to manage content and style as two separate entities. For example, we can transform our favorite photos into a Van Gogh painting. The algorithm takes in a content image, a style image and another input image (or white noise). The input image is optimized to match the previous two target images in content and style distance space.

We follow the two seminal work by Gatys et al. (2015) Gatys_a and Gatys_b. Click here for the official PyTorch tutorial. The assignment has several parts, starting from optimizing in content space, then over style space, and then finally combining the two to perform neural style transfer.

Content Reconstruction

For this part of the assignment, we implement content-space loss and optimize a random noise with respect to the content loss only. Content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as $X^l$ and that of target content image C as $C^l$. The content loss is defined as the squared L2-distance of these two features: $$L_{content}(\overrightarrow{x}, \overrightarrow{c}, l) = \frac{1}{2}\sum_{i,j}(X_{ij}^l - C_{ij}^l)^2$$ We can select at what level within the VGG-19 network we extract features to represent content. It is known that higher layers capture content better than lower layers, where the content reconstruction at these levels is almost perfect.

Original image
conv_1_1
conv_1_2
conv_2_1
conv_2_2
conv_3_1
conv_3_2
conv_3_3
conv_3_4
conv_4_1
conv_4_2
conv_4_3
conv_4_4
conv_5_1
conv_5_2
conv_5_3
conv_5_4
all

Above is the content loss reconstruction over all convolutional layers in VGG-19. We can see that lower convs maintain same spatial and texture features than the original, and from the higher conv layers of the 4th block, we see how texture and colors get distorted until becoming a noise ball. There is a drastic leap between conv_5_1 and conv_5_2.

Texture Synthesis

In this section, we will see the effect of capturing the style of an image using style-space loss. How do we measure the distance of the styles of two images? The Gram matrix is used as a style measurement. Gram matrix is the correlation of two vectors on every dimension. Specifically, denote the k-th dimension of the Lth-layer feature of an image as $f^L_k$ in the shape of (N,K,H∗W). Then the gram matrix is $G=f^L_k(f^L_k)^T$ in the shape of (N, K, K). The idea is that two of the gram matrix of our optimized and predicted feature should be as close as possible.

From the paper: To generate a texture that matches the style of a given image, we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimizing the mean-squared distance between the entries of the Gram matrix. Let $\overrightarrow{a}$ and $\overrightarrow{x}$ be the original image and the image that is generated and $A^l$ and $G^l$ their respective style representatinos in layer $l$. The contribution of that layer to the total loss is: $$E_l = \frac{1}{4N^2_lM^2_l}\sum_{i,j}(G^l_{ij}-A^l_{ij})^2$$ and the total loss is: $$L_{style}(\overrightarrow{a}, \overrightarrow{x}) = \sum^L_{l=0}w_lE_l$$ where $w_l$ are weighting factors of the contribution of each layer to the total loss.

Below, we display the battery of images corresponding to the style reconstruction at extracting features at all conv layers. We see how from conv_1 to conv_3_1, the color is preserved. From conv_3_2, color starts gearing towards noise. In this case, lower conv layers seem to yield a finer granularity of the texture, while higher conv layers seem to be composed of larger strokes.

Original image
conv_1_1
conv_1_2
conv_2_1
conv_2_2
conv_3_1
conv_3_2
conv_3_3
conv_3_4
conv_4_1
conv_4_2
conv_4_3
conv_4_4
conv_5_1
conv_5_2
conv_5_3
conv_5_4
all

We can also group several feature layers together and see what effect has on style. The best working texture is a bit subjective at this point, although to me, conv_1_1, conv_1_2 and conv 2_1 together is my favorite.

Original image
conv_1_1
conv_1_1 and conv 1_2
conv_1_1, conv_1_2, conv_2_1
conv_1_1, conv_1_2, conv_2_1, conv_2_2, conv_3_1

The following images capture the style of the images provided for this homework

Original image
Noise 1. conv_1_1, conv_1_2, and conv_2_1
Noise 2. conv_1_1, conv_1_2, and conv_2_1
Original image
Noise 1. conv_1_1, conv_1_2, and conv_2_1
Noise 2. conv_1_1, conv_1_2, and conv_2_1
Original image
Noise 1. conv_1_1, conv_1_2, and conv_2_1
Noise 2. conv_1_1, conv_1_2, and conv_2_1

Style Transfer

Ablation on Frida Kahlo and Falling water house (FLR)

We are now ready to put everything together and do some cool style transfers. The system is somewhat sensible to the hyperparameters, so a lot of testing was needed to get the best results. There is no just one same configuration for all images, since we have been tweaking different hyperparameters depending on the content and style images. To speed up computation, we set the number of iterations to 300, unless otherwise specified. All images use content weight = 1. Style weight varies.
Style image
Content image

Input image: whie noise

$\lambda=1000$. Style convs [1_1 to 3_1]. Content conv 2_2
$\lambda=10000$. Style convs [1_1 to 3_1]. Content conv 2_2
$\lambda=10000000$. Style convs [1_1 to 2_1]. Content conv 4_2
$\lambda=1000000000$. Style convs [1_1 to 2_1]. Content conv 4_2
$\lambda=10000000000$. Style convs [1_1 to 2_1]. Content conv 4_2

Input image: content image

$\lambda=1000$. Style convs [1_1 to 3_1]. Content conv 2_2
$\lambda=10000$. Style convs [1_1 to 3_1]. Content conv 2_2
$\lambda=10000000$. Style convs [1_1 to 2_1]. Content conv 4_2
$\lambda=1000000000$. Style convs [1_1 to 2_1]. Content conv 4_2
$\lambda=10000000000$. Style convs [1_1 to 2_1]. Content conv 4_2

From left to right, the first two images use a lower convolution layer as content feature image and a larger set of convolution feature maps to capture the style of the image. We need less weight in the style images in comparison to using higher content feature map (3 rightmost images). This is due to the fact that the last 3 images use 3 convolutional feature layers to capture style (1_1, 1_2 and 2_1), which makes the texture have a smaller granularity and therefore, we need to increase the style weight much more. The image at the middle does not capture the colors of the style image. We also see how in the last 2 images, the horizontal lines of the building are more wobbly, due to capturing content perhaps at a higher layer (4_2). Generally, the difference between having noise or content as input image is not that great when using low content feature layers. However, when using higher content feature layers, using content image as input image preserves the shape of the content better than having random noise as input (we can see this in the 3rd and 4th images). Lastly, content image as input explodes with such high style weight.

Ablation on The Scream and Phipps Conservatory

Style image
Content image

First, let's see the effect of changing the content layer only, maintaining the rest of the parameters. Style weight 1000, style layers from conv_1_1 to conv_3_1.

Content layer: conv_2_2
Content layer: conv_3_1
Content layer: conv_3_2
Content layer: conv_4_1

From left to right, we use as content feature layer conv_2_2, conv_3_1, conv_3_2 and conv_4_1. We see how the higher the content layer is, the more it preserves the original content image. We will choose the third image (content feature layer conv_3_2) to continue our ablation study.

Let's fix the content feature layer (conv_3_2) and modify the style feature layer, maintaining style weight at 1000.

Style layer: conv_1_1
Style layer: conv_2_1
Style layer: conv_3_1
Style layer: conv_4_1

And finally let's see the effect of adding more and more style feature layers

Style layer: conv_1_1 and 1_2
Style layer: conv_1_1, 1_2 and 2_1
Style layer: conv_1_1, 1_2, 2_1 and 2_2
Style layer: conv_1_1, 1_2, 2_1, 2_2 , 3_1 and 3_2

We can see how low style layers do not capture properly the texture of The Scream. We need to wait until conv_3_1 to see such effect. Style layer 4_1 covers the content image completely, so lower style weight would be better for such higher style layers.

In general, this gives us an intuition on the type of style and content weights we would need depending on what convolutional feature maps we use for style and content. Summarizing, higher content convolutional layers tend to preserve the content image better than lower layers. Likewise, higher style convolutional layers tend to capture longer strokes loaded with stylistic attributes. Let's see the final images, below, we just vary the style weight. Top row shows results with white noise as image input and bottom row shows results with content image as image input.

White noise

Style weight: 100. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2
Style weight: 1000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2
Style weight: 10000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Content input

Style weight: 1000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2
Style weight: 10000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2
Style weight: 100000. Style layer: conv_1_1 - 3_1. Contnet layer: conv_3_2

Grid of style / content images