Assignment #4 - Neural Style Transfer

Overview

In this assignment, we experiment with the implementation of neural style transfer, which involves taking a content image, a style image (representing an artistic style in the domain we wish to transfer the content image to), and produce the content image in the style’s domain. We use a VGG-19 net pertained on ImageNet, adding content loss and style loss to the end of certain layers. The initial experiments focus on reconstructing the target image using only content loss and random noise as an input, followed by an experiment reconstructing noise in the style space using only style loss. We conclude with experiments of the full style transfer on provided images, and some additional selected images.

Content Reconstruction

For the first experiment, I attempted optimizing content loss only at a handful of different layers in the pretrained VGG-19 network. The relationship is that the earlier in the network the content loss is optimized, the lower the content loss is, and the more faithful the reconstruction. As content loss is defined to be the squared L2-distance between the features of the input image and the target content image at a particular layer L, the content loss when L equals 0 is the L2 pixel loss. The deeper into the network content loss is optimized, the more abstract the latent representation, and the more noise is introduced into the resulting image (and accordingly the higher the content loss). Observe how the noise gets more intense as the content loss is optimized at layer 1, 3, and 5, respectively (the content loss values of the resulting image are 0.000, 0.025, and 0.484 respectively):

Preferably, the content loss we optimize would be as deep as we can go into the network without introducing too much noise, so I will set content loss to be optimized after the 3rd convolution layer. When optimizing with content loss only, the variations in the random noise images have little effect on the image that is eventually reconstructed. Slight variations can be detected in the loss values output during training (they vary from run to run). Here are two random noises reconstructed with the tubingen image.

Texture Synthesis

Here is the style loss after 300 iterations with style loss being optimized after each convolution layer: 0.000007, 0.004776, 0.306281, 0.596941 , 3.653684. (In each example, style loss is only optimized after that particular layer, e.g. for the second example, style loss is only optimized after the second convolution layer). Varying the configuration of where style loss is optimized has a drastically larger impact on the resulting image. Here is the output for each of those five experiments.

I also attempted experimenting with optimizing style loss after layer 1, 3, and 5, and here is the result for that experiment (interestingly looks the mot like optimizing after only layer 3).

Of all of the trials, it appeared that the default, optimizing style loss after every convolution layer, had the best result. Here is that result:

Using that configuration (optimizing style loss after every convolution layer), I then generated two images where the inputs were two different random noises. In the same way that varying the configuration of the style loss has a much larger effect than varying the content loss, the differences in the random noise inputs leads to a noticeably different output image. Interestingly, in the first example, the leaves are more pronounced, while in the second example, the black spots are more pronounced.

Style Transfer

In my implementation, I found that the default style weight of 1,000,000 was causing the network to focus too much on the style component, and not enough on the content. I experimented with several smaller values for the style weight, and found that decreasing the weight an order of magnitude to 100,000 was suitable for keeping the content weight at 1. These hyper parameters lead to the loss values being closer in magnitude. I also crop the larger of the two input images to be the same size as the smaller image, so that there are no numerical issues in the computation of the loss values and propagation through the network. Lastly, I make sure that the gram matrix is normalized over the feature pixels.

Here is a grid that shows the output of stylizing 2 different content images (tubingen and wally) in two different styles (frida kahlo and the scream):

Rows (content) / Columns (style)

I then tried to reproduce the results using a random noise as the input. I calculated the random noise to take 54.8 seconds to complete 300 iteartions, and the content image to take 54.3 seconds. Though the resulting images were somewhat different, I found the quality from the test with the content image input to result in a slightly higher quality output, in that the application of the style appeared more natural. Specifically, the swirling paint style near the dog’s right ear is too pronounced in the resulting image generated from the noise input, and obscures too much of the dog’s face, while in the other image, the dogs face and body is clearer in the foreground. Additionally, for some of the other experiments, I found that using the content image as the input instead of random noise led to significantly smaller content losses. The image generated from content is shown first, and noise is shown second.

Lastly, I performed style transfer with a mountainous landscape with the style of Ed Hopper (using his painting Chop Suey). Here is the content, style image, and resulting stylized image.

Extra Credit

For the extra credit portion, I experimented with stylizing one of my images from the Poisson blending assignment. Here is the original image I use as the content image (of the Amalfi coast with a seamonster blended in):

And I attempted to stylize it using Edward Hopper's Chop Suey:

Here is the result! (The clouds specifically are shockingly Hopper's style)