In this project, we will again employ deep neural networks to assist with our image synthesis task. However, instead of optimizing the parameters of the neural network, we will be optimizing the pixels of our synthesized image!
Our goal is to synthesize a new image that combines the content of one image with the style of another. To do this, we will use a version of VGG-19 that has been pre-trained on the ImageNet classification task. The thought here is that the feature-extraction (or convolutional) layers must have learned how to extract meaningful information from the source image in order to accurately solve the classification task.
We will pass our target image through the pretrained network along with our style and content images and compute a content and style loss by comparing the results after specific layers. We then synthesize our target image by optimizing this content and style loss.
Content Reconstruction
For the content loss, we will simply take the L2-distance between the target and content images after a specific layer of the network. If we denote the Lth-layer feature of the target image X as and the that of the content image as , then the content loss at layer L is defined as .
Since we are only optimizing the pixels of an image instead of the millions of parameters of a deep neural network, we can afford to use a second-order optimization method. This allows us to consider the second derivative (or Hessian) instead of just the first derivative (Gradient) which leads to more accurate optimization steps. For this work, we use the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm as our optimizer.
Consider the following content image:
If we initialize a target image as random noise, we can observe how results vary as we choose to optimize only the content at progressively deeper layers of the network:
We can see that reconstructing the original image becomes increasingly difficult as we move deeper in the network. This is because deep neural networks become increasingly less invertible as more non-linearities are added with each layer. Specifically, it seems like Conv_4 is the last layer before the reconstruction really falls off, so we will use that layer in our final style transfer.
It is interesting to note, however, that if we initialized the target image as the content image itself that we would see perfect reconstruction at each layer because the reconstruction loss would be 0, leading to 0 gradient and hessian and no updates to the pixels.
Included below are content reconstructions of two more images at specific layers:
Again, Conv_4 has decent reconstruction. As we move deeper (Conv_9), we start to see a lot of distortion and an almost random texture.
Texture Synthesis
To conceptualize the style of an image, we will analyze the Gram Matrix of the pixels at that layer. For a set of vectors (in this case, our pixels), the Gram matrix is a symmetric matrix is made up of the pair-wise inner products of all vectors in the set. In other words, if we reshape our features at a given layer to be of shape , then the Gram matrix is simply
Similarly to content loss, our style loss at layer L is then
Consider the following style image:
Like in the previous section, if we initialize a target image as random noise, we can observe how the results vary as we choose to optimize only the style at progressively deeper layers of the network:
As we optimize the style of deeper layers of the network, the texture of the resultant image becomes increasingly higher in frequency. Optimizing with the first 5 layers seems to give us the best-looking texture, so we will stick with that for our final style transfer.
Included below are texture syntheses from two more images:
The Scream – Edvard Munch Extracted Texture
We see that the extracted textures match the artists’ styles decently well without retaining any of the content (other than the colors).
Style Transfer
When optimizing both content and style loss, we quickly find that the two need to be weighted differently to achieve good results. In this implementation, we normalized our gram matrices by the number of hidden units and weighted the weighted the style loss a factor of 106 more than the content loss.
For the pre-trained network, we created a new network by copying layers from the downloadable version of VGG-19 included with PyTorch and injected our content and style losses after the appropriate layers.
Below is a grid (best viewed on desktop) of images created using our neural style transfer technique with content images on the top row and style images on the left column:
Dancing Falling Water Self-Portrait with Thorn Necklace and Hummingbird – Frida Kahlo Starry Night – Vincent van Gogh
Here is where we see the full effects of optimizing both style and content loss. We have taken the content of one image and seen it re-visualized in the style of another.
One other thing to consider when performing neural style transfer is whether to initialize the target image as a copy of the content image or as random noise.
Content Style Content Initialization Random Initialization
Initializing with the content image starts the content loss at a nice optimal point, so it takes fewer iterations of our optimizer to generate the final output. However, this also causes the process to somewhat overfit to the content image. We see in the random initialization that the texture on the dancer is much more varied and uses multiple colors whereas in the content initialization, the dancer is mostly a solid color and smooth texture, as in the content image.
Results on Other Images
Here, we include results on other images gathered from the internet and some from our previous projects.
First, here are the content images used to generate the images below:
We will now see the style image followed by the results of neural style transfer using the previously shown content images.
Thanks for reading