In this assignment, we implement neural style transfer. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images. In the first two parts, we focus on content reconstruction and texture synthesis separately. In the last part, we combine them together for neural style transfer.
We use content loss to reconstruct the content in content image. Specifically, the content loss is a metric function that measures the distance of the content image's and input image's feature map over a some layer(s) of the same neural network.
Below are the results with of optimizing the content loss over different layers. These layers are chosen since they are the last layer of each convolutional block of VGG-19.
We can see from the results above that the content loss over the shallow layers can generate better reconstruction of the content. Below are two reconstruction with content loss over the second convolutional layer from two different noise images, and their difference. We can see that for most pixel the two reconstructions are the same, but there are also many pixels that have different pixel intensities in either R, G or B channel.
The way we measure the distance between the styles of two images is to use the Gram matrix. Gram matrix is the correlation of two vectors on every dimension. Below we investigate optimizing texture loss over different layers with the synthesized texture images that simulates the style of Frida Kahlo. We can see from the results below that using shallow layers can produce more similar style as the style of the original style image.
Below are the texture syntheses from two different noise and their difference. We can see that the difference betwee the two texture syntheses from noises are much bigger than the content reconstruction from noises. This is because they are not meant to have the same content but the same texture style.
We tune the hyperparameters and use the following to conduct the style transfer where the loss consist of both the content loss and the style loss. We use the content loss over the second convolutional layer, the style loss over the first 5 convolutional layers. We set the style loss weight to be \( 1 \times 10^4 \) and keep the content loss to 1. And below are the resutls.
We also explore the difference between using a noise image as the input and using a clone of the content image as the image. Below are the results with the falling water as the content and the frida kahlo as the style and their differences. The time of optimization with noise as input is 45.7 seconds and the optimization with a clone of the content image is 47.3 seconds. As for the produced output images, we can see that apart from local brightness difference, the two output images are similar with each other.
Below are some results of stylizing some content images taken by me with some style images from the internet.
We stylize the cat images that we used in homework 3.