Master of Science in Machine Learning, 2023
This project aims to identify layers of a pretrained VGG19 model that best represent the style and content of an image, and attempts to perform style transfer through minimizing the loss between an input image and a set of a style and content images through loss minimization through the identified layers.
In order to reconstruct an image from pure noise, both images are run through the model until the identified content layer, at which the loss between both images is computed.
The image can be reconstructed through gradient descent of this loss on the pixels of the input image.
Each Conv2D layer was independently tested and the results of each reconstruction attempt are shown below.
Click on the appropriate buttons to view each image reconstructed at each convolutional layer.
The best results are achieved when the content reconstruction layer is placed nearer to the input (early layers).
Therefore, when identifying layers for minimizing content loss, these layers should be considered first.
Based on the above results, one strong candidate layer for optimizing the reconstruction loss is the first convolutional layer (my personal favourite).
Therefore, two random noise inputs were initialized from a uniform distribution, and optimized to the same content image using the first layer.
Despite being initalized from different noise images, the resultant images produced were relatively similar. The produced images were also close to the original image fed into the network.
Next, image textures are synthesized via a similar method. However, instead of directly using the reconstruction loss between two images at a particular layer, the Gram matrix is first computed, which represents the correlations between two vectors in each dimension.
At each identified layer, the feature maps were converted into Gram matrices, and the MSE loss was computed between the input and style images.
Some candidate sets were identified for texture synthesis:
Click on the appropriate buttons to view each texture synthesized at each convolutional layer.
While all produced images capture the essence of the style in the image, the images produced when considering the first or last layer from each block retained more content structure.
In order to just preserve the style from the style image, the optimal set of layers to be used for style loss minimization is 1, 2, 3, 4 and 5.
Using the above results, we repeat the same experiment using two different random noise initializations.
Two random noise inputs were initialized from a uniform distribution, and optimized to the same style image using the first five layers.
Both generated images were roughly representative of the style of the input image. However, the differences between both generated images are more apparent.
This shows that the style of the output image is more sensitive to the input image, as compared to the effect when reconstructing image content.
With the ability to reconstruct images and synthesize textures, a natural next step is to perform style transfer onto an image.
Given that the main interaction between the input image and the style and content images are through the minimization of loss through the identified layers, the layers to be used are therefore very important.
Heuristically, using earlier layers reproduces better content in the resultant image, while using a set of layers (first 5 layers, first layer in each block, last layer in each block) are all good candidates for injecting style.
To find out the effect of choosing different style and content layers, a content and a style image were chosen for style transfer.
Hover over each image to find out the effect of choosing different style and content layers.
From these 12 generated images, the image generated with the highest quality is the image that was generated with a content layer of 4 and a set of style layers corresponding to the first 5 convolutional layers.
These will be fixed and used from here on.
Setting the coefficient on the content loss to one, we can vary the coefficient on the style loss and observe the quality of the resultant images.
Hover over each image to find out the style weight parameter that generated that image.
The imaged trained with a style weight of 100,000 failed to converge.
Based on the image quality, the setting of 10,000,000 was observed to be optimal.
Therefore, all subsequent images are generated with the style weight set to 10,000,000.
With the settings optimized earlier, each provided content image was style-transferred to each provided style image.
When performing style transfer, the input image can either start from random noise, or the content image itself.
The most obvious benefit from starting from the content image is that fewer modifications have to be made to the input image, since the output should have a similar structure as the input image.
However, starting from the content image could result in a strong content signal from the start, which may lead to weak style transfer.
From this example, there are only minor differences between the generated images.
However, upon very close inspection, it can be seen that the image generated from the content image has finer details than the image generated from random noise.
Furthermore, the run times between both noise-initialized and content-initialized input images were 335.37 seconds and 336.63 seconds respectively, which are identical for all practical purposes.
In order to perform style transfer on a video, the video was first decomposed into its component frames.
Each frame was then used as the target content image for style transfer.
The resultant frames were then combined back into a video.