Seah Shao Xuan

Master of Science in Machine Learning, 2023

Neural Style Transfer

This project aims to identify layers of a pretrained VGG19 model that best represent the style and content of an image, and attempts to perform style transfer through minimizing the loss between an input image and a set of a style and content images through loss minimization through the identified layers.

Image Reconstruction

In order to reconstruct an image from pure noise, both images are run through the model until the identified content layer, at which the loss between both images is computed.

The image can be reconstructed through gradient descent of this loss on the pixels of the input image.

Each Conv2D layer was independently tested and the results of each reconstruction attempt are shown below.


Click on the appropriate buttons to view each image reconstructed at each convolutional layer.

Dancing

reconstructed_dancing_1.png reconstructed_dancing_2.png reconstructed_dancing_3.png reconstructed_dancing_4.png reconstructed_dancing_5.png reconstructed_dancing_6.png reconstructed_dancing_7.png reconstructed_dancing_8.png reconstructed_dancing_9.png reconstructed_dancing_10.png reconstructed_dancing_11.png reconstructed_dancing_12.png reconstructed_dancing_13.png reconstructed_dancing_14.png reconstructed_dancing_15.png reconstructed_dancing_16.png

Fallingwater

reconstructed_fallingwater_1.png reconstructed_fallingwater_2.png reconstructed_fallingwater_3.png reconstructed_fallingwater_4.png reconstructed_fallingwater_5.png reconstructed_fallingwater_6.png reconstructed_fallingwater_7.png reconstructed_fallingwater_8.png reconstructed_fallingwater_9.png reconstructed_fallingwater_10.png reconstructed_fallingwater_11.png reconstructed_fallingwater_12.png reconstructed_fallingwater_13.png reconstructed_fallingwater_14.png reconstructed_fallingwater_15.png reconstructed_fallingwater_16.png

The best results are achieved when the content reconstruction layer is placed nearer to the input (early layers).

Therefore, when identifying layers for minimizing content loss, these layers should be considered first.

Image Reconstruction from Different Noise Inputs

Based on the above results, one strong candidate layer for optimizing the reconstruction loss is the first convolutional layer (my personal favourite).

Therefore, two random noise inputs were initialized from a uniform distribution, and optimized to the same content image using the first layer.

Despite being initalized from different noise images, the resultant images produced were relatively similar. The produced images were also close to the original image fed into the network.

Texture Synthesis

Next, image textures are synthesized via a similar method. However, instead of directly using the reconstruction loss between two images at a particular layer, the Gram matrix is first computed, which represents the correlations between two vectors in each dimension.

At each identified layer, the feature maps were converted into Gram matrices, and the MSE loss was computed between the input and style images.

Some candidate sets were identified for texture synthesis:


Click on the appropriate buttons to view each texture synthesized at each convolutional layer.

Picasso

synthesized_picasso_1-2-3-4-5.png synthesized_picasso_1-3-5-9-13.png synthesized_picasso_2-4-8-12-16.png

Starry Night

synthesized_starrynight_1-2-3-4-5.png synthesized_starrynight_1-3-5-9-13.png synthesized_starrynight_2-4-8-12-16.png

While all produced images capture the essence of the style in the image, the images produced when considering the first or last layer from each block retained more content structure.

In order to just preserve the style from the style image, the optimal set of layers to be used for style loss minimization is 1, 2, 3, 4 and 5.

Texture Synthesis from Different Noise Inputs

Using the above results, we repeat the same experiment using two different random noise initializations.

Two random noise inputs were initialized from a uniform distribution, and optimized to the same style image using the first five layers.

Both generated images were roughly representative of the style of the input image. However, the differences between both generated images are more apparent.

This shows that the style of the output image is more sensitive to the input image, as compared to the effect when reconstructing image content.

Style Transfer

With the ability to reconstruct images and synthesize textures, a natural next step is to perform style transfer onto an image.

Hyperparameter Tuning

Given that the main interaction between the input image and the style and content images are through the minimization of loss through the identified layers, the layers to be used are therefore very important.

Heuristically, using earlier layers reproduces better content in the resultant image, while using a set of layers (first 5 layers, first layer in each block, last layer in each block) are all good candidates for injecting style.

To find out the effect of choosing different style and content layers, a content and a style image were chosen for style transfer.

Hover over each image to find out the effect of choosing different style and content layers.

From these 12 generated images, the image generated with the highest quality is the image that was generated with a content layer of 4 and a set of style layers corresponding to the first 5 convolutional layers.

These will be fixed and used from here on.

Optimizing Style Weights

Setting the coefficient on the content loss to one, we can vary the coefficient on the style loss and observe the quality of the resultant images.

Hover over each image to find out the style weight parameter that generated that image.

The imaged trained with a style weight of 100,000 failed to converge.

Based on the image quality, the setting of 10,000,000 was observed to be optimal.

Therefore, all subsequent images are generated with the style weight set to 10,000,000.

General Results

With the settings optimized earlier, each provided content image was style-transferred to each provided style image.

Content Images

Style Images

Results

Starting from Content vs. Starting from Random Noise

When performing style transfer, the input image can either start from random noise, or the content image itself.

The most obvious benefit from starting from the content image is that fewer modifications have to be made to the input image, since the output should have a similar structure as the input image.

However, starting from the content image could result in a strong content signal from the start, which may lead to weak style transfer.

From this example, there are only minor differences between the generated images.

However, upon very close inspection, it can be seen that the image generated from the content image has finer details than the image generated from random noise.

Furthermore, the run times between both noise-initialized and content-initialized input images were 335.37 seconds and 336.63 seconds respectively, which are identical for all practical purposes.

More Style Transfers

Style Images

Bells and Whistles

Styling Grumpy Cats

Video Style Transfer

In order to perform style transfer on a video, the video was first decomposed into its component frames.

Each frame was then used as the target content image for style transfer.

The resultant frames were then combined back into a video.

Style

style