16-726 Learning-Based Image Synthesis

Project 4: Neural Style Transfer

Chang Shi

Overview

In this assignment, we will implement neural style transfer which resembles specific content in a certain artistic style. For example, generate cat images in Ukiyo-e style. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images in content and style distance space. In the first part of the assignment, we start from random noise and optimize it in content space. It helps us get familiar with the general idea of optimizing pixels with respect to certain losses. In the second part of the assignment, we ignore content for a while and only optimize to generate textures. This builds some intuitive connection between style-space distance and gram matrix. Lastly, we combine all of these pieces to perform neural style transfer.

Part 1: Content Reconstruction

Optimizing content loss at different layers will lead to different content reconstruction effect.

Original content image

Reconstructed image by optimizing content loss at conv1, 300 steps	Reconstructed image by optimizing content loss at conv2, 300 steps
Reconstructed image by optimizing content loss at conv3, 300 steps	Reconstructed image by optimizing content loss at conv5, 300 steps

It is obvious that optimizing content loss at shallow layers will lead to a more vivid and detailed reconstruction, while optimizing content loss at deeper layers will only keep the higher level features like lines and figure outlines. This is because the features from shallow layers to deeper layers becomes more and more abstract, leaving the reconstruction focucing more on higher level structures. This aligns with our intuition, since if we optimize content loss at the initial layer, it is purely reconstructing the pixel value.

Reconstruction result from random noise 1, content loss at conv4, 300 steps

Reconstruction result from random noise 2, content loss at conv4, 300 steps

Since we are using random noises as input images, the reconstruction results are slightly different for different input random noises. But comparing with the original content image, they all keep the main content pretty well.

Part 2: Texture Synthesis

Optimizing texture loss at different layers will also lead to different texture synthesis effect.

Original style image

Synthesized texture by optimizing style loss at conv1, 300 steps	Synthesized texture by optimizing style loss at conv2, 300 steps
Synthesized texture by optimizing style loss at conv3, 300 steps	Synthesized texture by optimizing style loss at conv4, 300 steps
Synthesized texture by optimizing style loss at conv5, 300 steps

It is obvious that optimizing style loss at deeper layers will lead to a more natural and similar texture to the original style image. This aligns with our intuition, since style features are higher level features that may need to be captured after several convolutional layers.

Synthesized texture from random noise 1, style loss at all conv layers, 300 steps

Synthesized texture from random noise 2, style loss at all conv layers, 300 steps

Since we are using random noises as input images, the synthesized texture would be different on actual pixel values for different input random noises. But comparing with the original style image, they all share the same style. Apparently, the two synthesized texture above are artistically better than the ones ahead, it's because they are generated by optimizing style loss as a combination on all conv layers, rather than on single conv layer. This means artistical style features are usually a combination of low-level features and high-level features.

Part 3: Style Transfer

Style transfer combines both content reconstruction and texture synthesis. Thus, to optimize a input image to match two target images in content and style distance space, we need to tune hyper-parameters to balance between content and style. The final setting I used is:

Add content loss at ['conv_2'].
Add style loss at ['conv_1', 'conv_2', 'conv_3', 'conv_4', 'conv_5'].
style_weight=100000, content_weight=1.
num_steps=1000.
Gram matrix is normalized over feature pixels.

Four example results that are optimized from two content images mixing with two style images accordingly are shown below.

I also take input as random noise and a content image respectively. The style transfer results are as below:


Running time: 222.5716s	Running time: 225.0122s

We can see that using content image as input would require a little bit more running time while generating much more fine-grained style transfer result. Here are some other style transfer results from my favorite images.