Neural Style Transfer

In this project, we will again employ deep neural networks to assist with our image synthesis task. However, instead of optimizing the parameters of the neural network, we will be optimizing the pixels of our synthesized image!

Our goal is to synthesize a new image that combines the content of one image with the style of another. To do this, we will use a version of VGG-19 that has been pre-trained on the ImageNet classification task. The thought here is that the feature-extraction (or convolutional) layers must have learned how to extract meaningful information from the source image in order to accurately solve the classification task.

We will pass our target image through the pretrained network along with our style and content images and compute a content and style loss by comparing the results after specific layers. We then synthesize our target image by optimizing this content and style loss.

Content Reconstruction

For the content loss, we will simply take the L2-distance between the target and content images after a specific layer of the network. If we denote the L^th-layer feature of the target image X as $\displaystyle f_X^L$ and the that of the content image as $\displaystyle f_C^L$ , then the content loss at layer L is defined as $\displaystyle \mathcal{L}_{C} = \|f_X^L - f_C^L\|_2^2$ .

Since we are only optimizing the pixels of an image instead of the millions of parameters of a deep neural network, we can afford to use a second-order optimization method. This allows us to consider the second derivative (or Hessian) instead of just the first derivative (Gradient) which leads to more accurate optimization steps. For this work, we use the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm as our optimizer.

Consider the following content image:

If we initialize a target image as random noise, we can observe how results vary as we choose to optimize only the content at progressively deeper layers of the network:

Conv_15
Conv_1
Conv_2
Conv_3
Conv_4
Conv_5
Conv_6
Conv_7
Conv_8
Conv_9
Conv_10
Conv_11
Conv_12
Conv_13
Conv_14
Conv_15
Conv_1

Conv_1

We can see that reconstructing the original image becomes increasingly difficult as we move deeper in the network. This is because deep neural networks become increasingly less invertible as more non-linearities are added with each layer. Specifically, it seems like Conv_4 is the last layer before the reconstruction really falls off, so we will use that layer in our final style transfer.

It is interesting to note, however, that if we initialized the target image as the content image itself that we would see perfect reconstruction at each layer because the reconstruction loss would be 0, leading to 0 gradient and hessian and no updates to the pixels.

Included below are content reconstructions of two more images at specific layers:

Conv_9
Wally
Conv_4
Conv_9
Wally

Wally

Conv_9
Phipps
Conv_4
Conv_9
Phipps

Phipps

Again, Conv_4 has decent reconstruction. As we move deeper (Conv_9), we start to see a lot of distortion and an almost random texture.

Texture Synthesis

To conceptualize the style of an image, we will analyze the Gram Matrix of the pixels at that layer. For a set of vectors (in this case, our pixels), the Gram matrix is a symmetric matrix is made up of the pair-wise inner products of all vectors in the set. In other words, if we reshape our features at a given layer to be of shape $\displaystyle f_X^L \in \mathbb{R}^{(C\times H\cdot W)}$ , then the Gram matrix is simply $\displaystyle G_X^L = f_X^L (f_X^L)^T$

Similarly to content loss, our style loss at layer L is then $\displaystyle \mathcal{L}_S = \|G_X^L - G_S^L\|_2^2$

Consider the following style image:

Like in the previous section, if we initialize a target image as random noise, we can observe how the results vary as we choose to optimize only the style at progressively deeper layers of the network:

Conv_10-14
Conv_1-5
Conv_2-6
Conv_3-7
Conv_4-8
Conv_5-9
Conv_6-10
Conv_7-11
Conv_8-12
Conv_9-13
Conv_10-14
Conv_1-5

Conv_1-5

As we optimize the style of deeper layers of the network, the texture of the resultant image becomes increasingly higher in frequency. Optimizing with the first 5 layers seems to give us the best-looking texture, so we will stick with that for our final style transfer.

Included below are texture syntheses from two more images:

We see that the extracted textures match the artists’ styles decently well without retaining any of the content (other than the colors).

Style Transfer

When optimizing both content and style loss, we quickly find that the two need to be weighted differently to achieve good results. In this implementation, we normalized our gram matrices by the number of hidden units and weighted the weighted the style loss a factor of 10⁶ more than the content loss.

For the pre-trained network, we created a new network by copying layers from the downloadable version of VGG-19 included with PyTorch and injected our content and style losses after the appropriate layers.

Below is a grid (best viewed on desktop) of images created using our neural style transfer technique with content images on the top row and style images on the left column:

Here is where we see the full effects of optimizing both style and content loss. We have taken the content of one image and seen it re-visualized in the style of another.

One other thing to consider when performing neural style transfer is whether to initialize the target image as a copy of the content image or as random noise.

Initializing with the content image starts the content loss at a nice optimal point, so it takes fewer iterations of our optimizer to generate the final output. However, this also causes the process to somewhat overfit to the content image. We see in the random initialization that the texture on the dancer is much more varied and uses multiple colors whereas in the content initialization, the dancer is mostly a solid color and smooth texture, as in the content image.

Results on Other Images

Here, we include results on other images gathered from the internet and some from our previous projects.

First, here are the content images used to generate the images below: