image synthesis — Neural Style Transfer

Goal

Explore difference between content loss vs style loss
Perform content reconstruction
Perform texture synthesis
Perform style transfer

Reference Papers

Content Loss

Content loss is the squared difference between the feature maps of two images, which can be optionally normalized or averaged. We use a pretrained VGG-19 model to get the feature maps. We have a choice of which layer $L_i$ to use as a basis for calculating content loss.

Content Reconstruction


☝ content loss layer after `conv_1`	☝ content loss layer after `conv_2`
☝ content loss layer after `conv_4`	☝ content loss layer after `conv_8`

We test the effect of placing the content loss at different locations inside VGG-19 on reconstruction quality. Each convolution layer captures different levels of detail and thus the distance in each feature map space represent different things. For reconstruction purposes, obviously it’s best to use the original input layer ($i=0$) but we may want to base the content loss on a different layer to achieve different styles. I personally like the reconstruction result using a loss layer after conv_4 because it is reminiscent of an HDR image.

☝ Content loss layer based on output from conv_4 but the two images were initialized with two different noises. The difference is indiscernible to the human eye at a distance.


☝ difference between the two images instantiated as different noise samples	☝ difference between one of reconstructed images and the original image

We can see that the difference between the two images is just noise. We can also see that the reconstruction results using content loss based on later convolutional layers still manage to retain edges.

Texture Loss

Texture loss is similar to content loss - we compare the values of features of two images. However, whereas content loss is calculated by comparing the values of the same feature channel of the same layer, texture loss is perhaps ‘less forgiving’ because we want the texture loss to be optimized across the entire channel. So we compute the Gram matrix:

$$ G = X^TX = \left(\begin{array}{c} x_1^T \\ \vdots \\ x_m^T \end{array}\right) \left(\begin{array}{ccc} x_1 & \ldots & x_m \end{array}\right)$$

Where each vector $x_i$ is a vectorized feature map in a layer, and $X$ represents all the feature maps in a layer.

Texture Synthesis

Frida Kahlo's *Self-portrait with Thorn Necklace and Hummingbird* (1940)

Below are textures syntehsized from Frida Kahlo’s Self-portrait with Thorn Necklace and Hummingbird (1940).



☝ texture synthesized by placing texture loss layers after each layers: `conv_1` and `conv_2`	☝ texture synthesized by placing texture loss layers after each layers: `conv_3` and `conv_4`

☝ texture synthesized by placing texture loss layers after each layers: `conv_5` through `conv_8`	☝ texture synthesized by placing texture loss layers after each layers: `conv_9` through `conv_12`

We can see that the earlier we place the texture loss layers, the larger the detail is. For instance, the top row depicts blobs of colour that represent the image while the bottom row depicts more of the characteristic thorny and pointy texture of the thorns and leaves. Texture synthesized with texture loss layers placed after conv_9 through conv_12 was closer to noise than texture, so it was ommitted.

☝ Texture loss layer based on output from all conv layers but the two images were initialized with two different noises. We can clearly see that these two images are different.

The above result may come as a surprise because I had mentioned that the texture loss is ‘less forgiving’ but the two images turned out quite different. Unlike the results of image reconstruction from content loss, it is impossible to force all features of the texture loss layer to be the same with each other due to its construction using Gram matrix.

Style Transfer

Now we put both content loss and texture loss into the equation.

The hyperparameters I often had good results were:

content_layers="conv_9" 
style_layers="conv_2,conv_4,conv_8,conv_16" 
content_weight=1
style_weight=0.5 

My workflow for finding the best output is as follows:

Choose two images that have similar themes. eg) buildings, portrait, grungy, dark background & light foreground, etc.
tune texture loss layer location by synthesizing texture
tune style weight and steps (I found these to have less impact on overall image quality)