Project 4

This project aims to identify layers of a pretrained VGG19 model that best represent the style and content of an image, and attempts to perform style transfer through minimizing the loss between an input image and a set of a style and content images through loss minimization through the identified layers.

Firstly, images are recreated from noise by minimizing the MSE loss between the image and the input noise image at a fixed layer.
Then, the style of an image is extracted by minimizing the MSE loss between the Gram matrix of the input and style image at the identified layers.
Lastly, style transfer is achieved by minimizing the loss of the Gram matrix at those layers, to an input image containing the desired content.

Image Reconstruction

In order to reconstruct an image from pure noise, both images are run through the model until the identified content layer, at which the loss between both images is computed.

The image can be reconstructed through gradient descent of this loss on the pixels of the input image.

Each Conv2D layer was independently tested and the results of each reconstruction attempt are shown below.

Click on the appropriate buttons to view each image reconstructed at each convolutional layer.

Dancing

Fallingwater

The best results are achieved when the content reconstruction layer is placed nearer to the input (early layers).

Therefore, when identifying layers for minimizing content loss, these layers should be considered first.

Image Reconstruction from Different Noise Inputs

Based on the above results, one strong candidate layer for optimizing the reconstruction loss is the first convolutional layer (my personal favourite).

Therefore, two random noise inputs were initialized from a uniform distribution, and optimized to the same content image using the first layer.

Despite being initalized from different noise images, the resultant images produced were relatively similar. The produced images were also close to the original image fed into the network.

Texture Synthesis

Next, image textures are synthesized via a similar method. However, instead of directly using the reconstruction loss between two images at a particular layer, the Gram matrix is first computed, which represents the correlations between two vectors in each dimension.

At each identified layer, the feature maps were converted into Gram matrices, and the MSE loss was computed between the input and style images.

Click on the appropriate buttons to view each texture synthesized at each convolutional layer.

Picasso

Starry Night

While all produced images capture the essence of the style in the image, the images produced when considering the first or last layer from each block retained more content structure.

In order to just preserve the style from the style image, the optimal set of layers to be used for style loss minimization is 1, 2, 3, 4 and 5.

Texture Synthesis from Different Noise Inputs

Using the above results, we repeat the same experiment using two different random noise initializations.

Two random noise inputs were initialized from a uniform distribution, and optimized to the same style image using the first five layers.

Both generated images were roughly representative of the style of the input image. However, the differences between both generated images are more apparent.

This shows that the style of the output image is more sensitive to the input image, as compared to the effect when reconstructing image content.

Style Transfer

With the ability to reconstruct images and synthesize textures, a natural next step is to perform style transfer onto an image.

Hyperparameter Tuning

Given that the main interaction between the input image and the style and content images are through the minimization of loss through the identified layers, the layers to be used are therefore very important.

Heuristically, using earlier layers reproduces better content in the resultant image, while using a set of layers (first 5 layers, first layer in each block, last layer in each block) are all good candidates for injecting style.

To find out the effect of choosing different style and content layers, a content and a style image were chosen for style transfer.

Hover over each image to find out the effect of choosing different style and content layers.

Content Layer 1
Style Layers 1, 2, 3, 4, 5

Content Layer 1
Style Layers 1, 3, 5, 9, 13

Content Layer 1
Style Layers 2, 4, 8, 12, 16

Content Layer 2
Style Layers 1, 2, 3, 4, 5

Content Layer 2
Style Layers 1, 3, 5, 9, 13

Content Layer 2
Style Layers 2, 4, 8, 12, 16

Content Layer 3
Style Layers 1, 2, 3, 4, 5

Content Layer 3
Style Layers 1, 3, 5, 9, 13

Content Layer 3
Style Layers 2, 4, 8, 12, 16

Content Layer 4
Style Layers 1, 2, 3, 4, 5

Content Layer 4
Style Layers 1, 3, 5, 9, 13

Content Layer 4
Style Layers 2, 4, 8, 12, 16

From these 12 generated images, the image generated with the highest quality is the image that was generated with a content layer of 4 and a set of style layers corresponding to the first 5 convolutional layers.

Optimizing Style Weights

Setting the coefficient on the content loss to one, we can vary the coefficient on the style loss and observe the quality of the resultant images.

Hover over each image to find out the style weight parameter that generated that image.

Based on the image quality, the setting of 10,000,000 was observed to be optimal.

Therefore, all subsequent images are generated with the style weight set to 10,000,000.

General Results

With the settings optimized earlier, each provided content image was style-transferred to each provided style image.

Content Images

Style Images

Results

Style Image Escher Sphere
Content Image Dancing

Style Image Escher Sphere
Content Image Fallingwater

Style Image Escher Sphere
Content Image Phipps

Style Image Escher Sphere
Content Image Tubingen

Style Image Escher Sphere
Content Image Wally

Style Image Frida Kahlo
Content Image Dancing

Style Image Frida Kahlo
Content Image Fallingwater

Style Image Frida Kahlo
Content Image Phipps

Style Image Frida Kahlo
Content Image Tubingen

Style Image Frida Kahlo
Content Image Wally

Style Image Picasso
Content Image Dancing

Style Image Picasso
Content Image Fallingwater

Style Image Picasso
Content Image Phipps

Style Image Picasso
Content Image Tubingen

Style Image Picasso
Content Image Wally

Style Image Starry Night
Content Image Dancing

Style Image Starry Night
Content Image Fallingwater

Style Image Starry Night
Content Image Phipps

Style Image Starry Night
Content Image Tubingen

Style Image Starry Night
Content Image Wally

Style Image The Scream
Content Image Dancing

Style Image The Scream
Content Image Fallingwater

Style Image The Scream
Content Image Phipps

Style Image The Scream
Content Image Tubingen

Style Image The Scream
Content Image Wally

Starting from Content vs. Starting from Random Noise

When performing style transfer, the input image can either start from random noise, or the content image itself.

The most obvious benefit from starting from the content image is that fewer modifications have to be made to the input image, since the output should have a similar structure as the input image.

However, starting from the content image could result in a strong content signal from the start, which may lead to weak style transfer.

From this example, there are only minor differences between the generated images.

However, upon very close inspection, it can be seen that the image generated from the content image has finer details than the image generated from random noise.

Furthermore, the run times between both noise-initialized and content-initialized input images were 335.37 seconds and 336.63 seconds respectively, which are identical for all practical purposes.

More Style Transfers

Style Images

Bells and Whistles

Styling Grumpy Cats

Video Style Transfer

In order to perform style transfer on a video, the video was first decomposed into its component frames.