Assignment #4 - Neural Style Transfer

Tarasha Khurana

Andrew ID: tkhurana

Content Loss

I used phipps.jpeg to study the effect of using convolution layer blocks for optimizing the content loss. The following images show the reconstructed image on using content loss at the end of the VGG conv layers mentioned in each image's caption.

Ablation

conv_1 + conv_2
conv_3 + conv_4

conv_5 + conv_6 + conv_7
conv_8 + conv_9 + conv_10

conv_11 + conv_12 + conv_13
conv_14 + conv_15 + conv_16

It can be seen that the later layers of VGG are not able to help in the image reconstruction. A perfect reconstruction is obtained from the layers in the first convolution block followed by the second and so on. For the final task of style transfer I chose to optimize for the content loss after the two conv layers in the second block (conv_3 and conv_4). This was because putting the content loss layers after the first block was resulting in style transfer that had more content than the given style from the style image. It seemed like for style transfer, we need only a close-to-perfect reconstruction and not pixel-accurate.

Results

On running the optimization for fallingwater.jpeg with two different random initializations, I obtained the following results:

first run
second run

Content Image

As compared to the original content image, both the reconstructions are pretty accurate and match almost exactly with the content image except they contain some white noise. In the two different runs, this white noise appears at different pixels. This is expected as every run starts with a different initialization and effectively, a different starting point on the loss curve such that every time a different minima is obtained.

Style Loss

I used starry_night.jpg to study the effect of using convolution layer blocks for optimizing the style loss. The following images show the synthesized texture on using style loss at the end of the VGG conv layers mentioned in each image's caption.

Ablation

conv_1 + conv_2
conv_3 + conv_4

conv_5 + conv_6 + conv_7
conv_8 + conv_9 + conv_10

conv_11 + conv_12 + conv_13
conv_14 + conv_15 + conv_16

conv_1 + conv_2 + conv_3 + conv_4
Style Image

As compared to the original style image, I find that none of these combinations look close to the actual style of the image but the outputs from the first two conv blocks look close. So for the final style transfer, I use a combination of layers from the first two blocks (conv_1 through conv_4).

Results

Using this, I took two random noise images and optimized them to synthesize the texture from the_scream.jpg . Between both the runs, the textures generated were globally similar but locally different. This was again expected because of the same reason explained for content loss optimization above.

first run
second run

Style Transfer

Implementation Details

I ablated on different sets of convolutional blocks as shown in the report above. For these, I tried tuning the weight of the content loss by adding or reducing the order of the weight by 1 and 2. However, after repeated runs I found the default hyperparameters in the starter code to work the best so I stuck with them. I did not have to do much tuning for the gram matrix. More generally, I tried keeping the same hyperparameters for all pairs of style and content images.

Apart from this, I was accidentally updating the network weights when implementing the gram matrix but was able to fix this by cloning the activations variable into another variable. It was also important for the style and content images to be of the same size so as to concatenate their features for optimization. To this end, I resized the style image to the resolution of the content image.

I did this assignment on my lab's cluster with a GeForce FTX 1080 Ti.

Results

Content v/s Random Initialization

Initialization with content image fallingwater.jpeg
Initialization with random noise

In terms of quality, the optimization which is initialized with the content image results in a better style transfer than the optimization which is initialized with a random noise image. Both take 23s for me. In that latter, more style is visible than the content which to me, suggests that the loss for the content transfer should be tuned to weigh slightly higher than the style loss. Doing this should alleviate the issue and make the style transfer from a noise image also good-quality.

Style transfer on miscellaneous pictures

Style from frida_kahlo.jpeg on spring flowers
Style from starry_night.jpeg on my boyfriend and me
Style from picasso.jpeg on spring flowers