Goal
- Explore difference between content loss vs style loss
- Perform content reconstruction
- Perform texture synthesis
- Perform style transfer
Reference Papers
- Texture Synthesis Using Convolutional Neural Networks, Gatys et al., 2015
- A Neural Algorithm of Artistic Style, Gatys et al., 2015
Content Loss
Content loss is the squared difference between the feature maps of two images, which can be optionally normalized or averaged. We use a pretrained VGG-19 model to get the feature maps. We have a choice of which layer \(L_i\) to use as a basis for calculating content loss.
Content Reconstruction
☝ content loss layer after conv_1 |
☝ content loss layer after conv_2 |
☝ content loss layer after conv_4 |
☝ content loss layer after conv_8 |
We test the effect of placing the content loss at different locations inside VGG-19 on reconstruction quality. Each convolution layer captures different levels of detail and thus the distance in each feature map space represent different things. For reconstruction purposes, obviously it’s best to use the original input layer (\(i=0\)) but we may want to base the content loss on a different layer to achieve different styles. I personally like the reconstruction result using a loss layer after conv_4
because it is reminiscent of an HDR image.
☝ Content loss layer based on output from conv_4
but the two images were initialized with two different noises. The difference is indiscernible to the human eye at a distance.
☝ difference between the two images instantiated as different noise samples | ☝ difference between one of reconstructed images and the original image |
We can see that the difference between the two images is just noise. We can also see that the reconstruction results using content loss based on later convolutional layers still manage to retain edges.
Texture Loss
Texture loss is similar to content loss - we compare the values of features of two images. However, whereas content loss is calculated by comparing the values of the same feature channel of the same layer, texture loss is perhaps ‘less forgiving’ because we want the texture loss to be optimized across the entire channel. So we compute the Gram matrix:
Where each vector \(x_i\) is a vectorized feature map in a layer, and \(X\) represents all the feature maps in a layer.
Texture Synthesis
Below are textures syntehsized from Frida Kahlo’s Self-portrait with Thorn Necklace and Hummingbird (1940).
☝ texture synthesized by placing texture loss layers after each layers: conv_1 and conv_2 |
☝ texture synthesized by placing texture loss layers after each layers: conv_3 and conv_4 |
☝ texture synthesized by placing texture loss layers after each layers: conv_5 through conv_8 |
☝ texture synthesized by placing texture loss layers after each layers: conv_9 through conv_12 |
We can see that the earlier we place the texture loss layers, the larger the detail is. For instance, the top row depicts blobs of colour that represent the image while the bottom row depicts more of the characteristic thorny and pointy texture of the thorns and leaves. Texture synthesized with texture loss layers placed after conv_9
through conv_12
was closer to noise than texture, so it was ommitted.
☝ Texture loss layer based on output from all conv layers but the two images were initialized with two different noises. We can clearly see that these two images are different.
The above result may come as a surprise because I had mentioned that the texture loss is ‘less forgiving’ but the two images turned out quite different. Unlike the results of image reconstruction from content loss, it is impossible to force all features of the texture loss layer to be the same with each other due to its construction using Gram matrix.
Style Transfer
Now we put both content loss and texture loss into the equation.
The hyperparameters I often had good results were:
content_layers="conv_9"
style_layers="conv_2,conv_4,conv_8,conv_16"
content_weight=1
style_weight=0.5
My workflow for finding the best output is as follows:
- Choose two images that have similar themes. eg) buildings, portrait, grungy, dark background & light foreground, etc.
- tune texture loss layer location by synthesizing texture
- tune style weight and steps (I found these to have less impact on overall image quality)
Results
On my GPU, there is not much of a difference whether we run the algorithm with an image prior or with a randomly sampled noise.
☝ about 100 seconds | ☝ about 100 seconds |
Results using dataset:
Results using some of my own photos: