Neural Style Transfer

Sudeep Dasari

Andrew ID: sdasari


Overview

In this project, we write an algorithm to transfer style from one picture onto the content for another. For example, this tutorial shows the classic Lena picture blended with the style from The Scream by Munch.

Part 1: Content Reconstruction

The baseline VGG neural network architecture is augmented with Content Loss blocks. These blocks calculate the mean squared error between the calculated latent and a target value derived from input image. This loss is backpropagated through the network and used to optimize the input noise image.

1.1 Where to place loss?

We clearly find that placing the content loss at lower convolutional layers creates a more accurate image reconstruction. The higher layers do a better job capturing the basic image content (look at the "sketched" dog) but ignore the texture information. This property will prove to be crucial to make style transfer work later.
Wally Image Conv2 Recon Conv3 Recon Conv4 Recon

1.2 Is it Repeatable?

My favorite reconstruction layer is Conv3, since you can see the content but the texture has this cool washed out effect. I re-create the dancer image twice using different noise initializations. Note that all the important image content is the same across both samples, but the noise patterns change. This is good since it indicates that the conv3 layer is paying attention to the important contact details while ignoring some less useful texture/noise information.
Dancer Image Conv3 Recon 1 Conv3 Recon 2

Part 2: Texture Synthesis

The same procedure from part 1 is repeated but with style loss instead of content. Of course style is an abstract property, but it can be captured mathematically using the Gram matrix of the input feature. Simply take the dot product of the features with itself \(F F^T\) and normalize with respect to input dimensions. The loss is measured as the Mean Squared Error between feature's Gram matrix and the target value.

2.1 Where to place loss?

I find that placing style loss at higher layers does a better job capturing larger abstract structure (e.g. swirls in images), whereas the lower layers pay more attention to smaller color and texture splotch details. This is likely due to receptive field of larger layers vs smaller ones, combined with loss of color information higher in network. The best reconstruction actually occurs when I optimize the Gram matrix across all the conv layers. Here both the "larger" and "smaller" parts of style are well captured.

Starry Night Image Conv1 Recon Conv4 Recon All Recon

2.2 Is it Repeatable?

Again I test repeatability. I optimize multiple noise vectors to match the Frida reference image using the style loss on all convolutional layers. Note that each of the images captures the overall "style" but the specific image structure differs per sample. In particular, there are creepy floating eyes in different places for each noise sample. Again this is good! It means we are capturing the style we care about while ignoring content.

Starry Night Image All Conv Recon1 All Conv Recon2

Part 3: Style Transfer

I combine both the style and content loss by weighting and adding them! The result is a final style transfer algorithm

3.1 What Hyperparameters?

I add style loss to all convolutional layers and add content loss only to conv4. This proved to be a good balance between content and style losses. For the style Gram matrix, I ended up normalizing it by dividing through the feature dimensions. This results in me needing to upweight the style loss by a factor of 1 million, compared to content loss which is just weighted by 1. I found that optimizing for too long can run into instability depending on the initialization scheme. For random initialization I optimize for longer (~200 iters) whereas when initializing with content image I can run fewer iterations (~50-100).

3.2 2x2 Grid

Here I style transfer the dancer and dog images using the Frida and Picasso style images. All optimizations start from the content image, since starting from random can result in very vague content. You can see that the Frida texture transfers better onto dancer (misses whole splotches on dog), while Picasso transfers well onto both.

3.3 How to Initialize?

With my settings initializing from content does work better than initializing from random. This could be because my content loss was placed higher in the network. As you can see, the random initialization does show some content (vague building shapes), but misses a lot of important details. This is consistent with my conv4 reconstructions from part 1. I believe the tradeoff is worth it, however, since initializing from content image looks quite nice with these settings.

Content Initialization Random Initialization

3.4 My own Images

I now apply style transfer to some of my own images. I present Luke Skywalker painted by Picasso, and The Blue Marble painted like Starry Night.


Website template graciously stolen from here