16-726 Learning-Based-Image-Synthesis

Yiwen Zhao's Project Page

Assignment #4 - Neural Style Transfer

Background

Previous artists with strong personal styles left behind masterpieces that we appreciate. Sometimes, we wish to transfer the artistic style from these existing works to our custom scenarios while ensuring that our original content remains largely intact. The sources of both style and content can be in images, videos, or other formats, as can the generated target.

Motivation

Generate a sample image with the style of style_img and the content of content_img.

Content Reconstruction

The model VGG19 is fixed and trimed as a feature encoder, and the target image is detached from the computational graph. So the whole purpose of the gradient update is to lead the input_img using the collection of loss in one forward pass to update itself through the backpropagation. The class ContentLoss is implemented as a transparent layer, through which the input is not modified, but the contentloss regarding to the embedded target in same shape is calculated and stored. It is conducted in the feature space, rather than the pixel space.

Firstly, I tried to add this ContentLoss layer after conv4. The small learning rate for optim.LBFGS is important. With the default value of 1, the updating pace is too fast and the process is easy to fall into local minima. I implement a naive stop criterion, which stops the loss if it grows larger after one step and use lr=0.01.

img phipps

content source

img contentL_lr001_conv1_1 img contentL_lr001_conv2_1 img contentL_lr001_conv3_1 img contentL_lr001_conv4_1 img contentL_lr001_conv5_1

Add content loss after a single conv layer, conv1 ~ conv5, images resized to square.

Using features in different layers to calculate content loss leads to different results. More coarse the feature map, more blurry the reconstructed image. On the contrary, finer feature maps leads to sharper reconstructed results. Among all the outcomes, conv2 shows a better preservation of color.

Texture Synthesis

The implementation of StyleLoss is quite like the ContentLoss. What makes it focus on texture is the stocastic Gram Matrix. Adding a single layer of StyleLoss does not work well. Very similar to the content reconstruction, coarse layer focus on coarse-grined features, and fine layer focus on fine-grined features.

img escher_sphere

content source

img styleL_lr001_conv1_1 img styleL_lr001_conv2_1 img styleL_lr001_conv3_1 img styleL_lr001_conv4_1 img styleL_lr001_conv5_1

Add style loss after a single conv layer, conv1 ~ conv5, images resized to square.

img styleL_lr001_conv12345_2 img styleL_lr001_conv12345_3

Add style loss after conv1 ~ conv5, images resized to square, using different random seed to initialize.

I prefer the result using conv1 ~ conv5 together. Since the input_img is initialized by white noise, if the random seed is fixed, then two runs will result the same. Using different seeds allow the samples to be diverse.

Style Transfer

I choose the hyperparameter of lr=0.01, style_weight=20, and content_weight=1, The reconstructed results from given images are shown in grid.

img styleL_lr001_conv1_1

The source images are picasso, the scream for style and dancing, wally for content, respectively.

Using random white noise or content image with the same random seed for initialization gets different results. I think content-initialization better preserves the color of the original image.

img st_contentinit img st_randominit

Left: content init -- Right: random init.

Here's the result from my own images.

img style3 img reconstructed_image1 img content3

Style -- Reconstructed Result -- Content.