16-726 Learning-Based Image Synthesis

Assignment 4: Neural Style Transfer
Jun Luo
Apr. 2021

Overview

In this assignment, we implement neural style transfer. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images. In the first two parts, we focus on content reconstruction and texture synthesis separately. In the last part, we combine them together for neural style transfer.


Content Reconstruction

We use content loss to reconstruct the content in content image. Specifically, the content loss is a metric function that measures the distance of the content image's and input image's feature map over a some layer(s) of the same neural network.

Below are the results with of optimizing the content loss over different layers. These layers are chosen since they are the last layer of each convolutional block of VGG-19.

Original image
conv_2
conv_4
conv_7
conv_10
conv_13

We can see from the results above that the content loss over the shallow layers can generate better reconstruction of the content. Below are two reconstruction with content loss over the second convolutional layer from two different noise images, and their difference. We can see that for most pixel the two reconstructions are the same, but there are also many pixels that have different pixel intensities in either R, G or B channel.

reconstruction from noise1
reconstruction from noise2
difference

Texture Synthesis

The way we measure the distance between the styles of two images is to use the Gram matrix. Gram matrix is the correlation of two vectors on every dimension. Below we investigate optimizing texture loss over different layers with the synthesized texture images that simulates the style of Frida Kahlo. We can see from the results below that using shallow layers can produce more similar style as the style of the original style image.

Original image
conv1 ~ 5
conv6 ~ 10
conv10 ~ 13

Below are the texture syntheses from two different noise and their difference. We can see that the difference betwee the two texture syntheses from noises are much bigger than the content reconstruction from noises. This is because they are not meant to have the same content but the same texture style.

texture synthesis from noise1
texture synthesis from noise2
difference

Style Transfer

We tune the hyperparameters and use the following to conduct the style transfer where the loss consist of both the content loss and the style loss. We use the content loss over the second convolutional layer, the style loss over the first 5 convolutional layers. We set the style loss weight to be \( 1 \times 10^4 \) and keep the content loss to 1. And below are the resutls.

We also explore the difference between using a noise image as the input and using a clone of the content image as the image. Below are the results with the falling water as the content and the frida kahlo as the style and their differences. The time of optimization with noise as input is 45.7 seconds and the optimization with a clone of the content image is 47.3 seconds. As for the produced output images, we can see that apart from local brightness difference, the two output images are similar with each other.

transfer from noise
transfer from content image
difference

Below are some results of stylizing some content images taken by me with some style images from the internet.

content image (my photo)
style image (source)
content + style
content image (my photo)
style image (source)
content + style

Bells & Whistles

We stylize the cat images that we used in homework 3.

content image
style image
stylized grumpy cat
content image
style image
stylized russian blue