11-747 Learning-based Image Synthesis Manuel Rodriguez Ladron de Guevara
This assignment explores neural style transfer, a technique to manage content and style as two separate entities.
For example, we can transform our favorite photos into a Van Gogh painting. The algorithm takes in a content
image, a style image and another input image (or white noise). The input image is optimized to match the previous
two target images in content and style distance space.
We follow the two seminal work by Gatys et al. (2015) Gatys_a and Gatys_b. Click here for the official PyTorch tutorial. The assignment has several parts, starting from optimizing in content space, then over style space, and then finally combining the two to perform neural style transfer.
For this part of the assignment, we implement content-space loss and optimize a random noise with respect to the content loss only. Content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as $X^l$ and that of target content image C as $C^l$. The content loss is defined as the squared L2-distance of these two features: $$L_{content}(\overrightarrow{x}, \overrightarrow{c}, l) = \frac{1}{2}\sum_{i,j}(X_{ij}^l - C_{ij}^l)^2$$ We can select at what level within the VGG-19 network we extract features to represent content. It is known that higher layers capture content better than lower layers, where the content reconstruction at these levels is almost perfect.
Above is the content loss reconstruction over all convolutional layers in VGG-19. We can see that lower convs maintain same spatial and texture features than the original, and from the higher conv layers of the 4th block, we see how texture and colors get distorted until becoming a noise ball. There is a drastic leap between conv_5_1 and conv_5_2.
In this section, we will see the effect of capturing the style of an image using style-space loss. How do we measure the distance of the styles of two images? The Gram matrix is used as a style measurement. Gram matrix is the correlation of two vectors on every dimension. Specifically, denote the k-th dimension of the Lth-layer feature of an image as $f^L_k$ in the shape of (N,K,H∗W). Then the gram matrix is $G=f^L_k(f^L_k)^T$ in the shape of (N, K, K). The idea is that two of the gram matrix of our optimized and predicted feature should be as close as possible.
From the paper: To generate a texture that matches the style of a given image, we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimizing the mean-squared distance between the entries of the Gram matrix. Let $\overrightarrow{a}$ and $\overrightarrow{x}$ be the original image and the image that is generated and $A^l$ and $G^l$ their respective style representatinos in layer $l$. The contribution of that layer to the total loss is: $$E_l = \frac{1}{4N^2_lM^2_l}\sum_{i,j}(G^l_{ij}-A^l_{ij})^2$$ and the total loss is: $$L_{style}(\overrightarrow{a}, \overrightarrow{x}) = \sum^L_{l=0}w_lE_l$$ where $w_l$ are weighting factors of the contribution of each layer to the total loss.
Below, we display the battery of images corresponding to the style reconstruction at extracting features at all conv layers. We see how from conv_1 to conv_3_1, the color is preserved. From conv_3_2, color starts gearing towards noise. In this case, lower conv layers seem to yield a finer granularity of the texture, while higher conv layers seem to be composed of larger strokes.
We can also group several feature layers together and see what effect has on style. The best working texture is a bit subjective at this point, although to me, conv_1_1, conv_1_2 and conv 2_1 together is my favorite.
The following images capture the style of the images provided for this homework
Input image: whie noise
Input image: content image
From left to right, the first two images use a lower convolution layer as content feature image and a larger set of convolution feature maps to capture the style of the image. We need less weight in the style images in comparison to using higher content feature map (3 rightmost images). This is due to the fact that the last 3 images use 3 convolutional feature layers to capture style (1_1, 1_2 and 2_1), which makes the texture have a smaller granularity and therefore, we need to increase the style weight much more. The image at the middle does not capture the colors of the style image. We also see how in the last 2 images, the horizontal lines of the building are more wobbly, due to capturing content perhaps at a higher layer (4_2). Generally, the difference between having noise or content as input image is not that great when using low content feature layers. However, when using higher content feature layers, using content image as input image preserves the shape of the content better than having random noise as input (we can see this in the 3rd and 4th images). Lastly, content image as input explodes with such high style weight.
First, let's see the effect of changing the content layer only, maintaining the rest of the parameters. Style weight 1000, style layers from conv_1_1 to conv_3_1.
From left to right, we use as content feature layer conv_2_2, conv_3_1, conv_3_2 and conv_4_1. We see how the higher the content layer is, the more it preserves the original content image. We will choose the third image (content feature layer conv_3_2) to continue our ablation study.
Let's fix the content feature layer (conv_3_2) and modify the style feature layer, maintaining style weight at 1000.
And finally let's see the effect of adding more and more style feature layers
We can see how low style layers do not capture properly the texture of The Scream. We need to wait until conv_3_1 to see such effect. Style layer 4_1 covers the content image completely, so lower style weight would be better for such higher style layers.
In general, this gives us an intuition on the type of style and content weights we would need depending on what convolutional feature maps we use for style and content. Summarizing, higher content convolutional layers tend to preserve the content image better than lower layers. Likewise, higher style convolutional layers tend to capture longer strokes loaded with stylistic attributes. Let's see the final images, below, we just vary the style weight. Top row shows results with white noise as image input and bottom row shows results with content image as image input.
White noise
Content input