Neural Style Transfer

Carnegie Mellon University, 16-726 Learning-Based Image Synthesis, Spring 2021

Null Reaper Logo
Null Reaper (Clive Gomes)

Task

The goal of this project was to implement neural style transfer; specifically, we attempted to recreate a certain content in a different art style. We did this by taking two input images—one for content and the other for style—and optimizing the corresponding content & style losses against the network's output.

Content Reconstruction

For the initial setup, we implemented the content-space loss alone and studied the output of the Neural Net. The following image was used for the content reconstruction experiments.

Input Image
Figure 1: Input Image for Content Reconstruction

Defining Content loss

Content loss is used to measure the difference in content between two images; this is typically measured at a certain layer of the network. We define this loss as the mean-squared error (MSE) between the (corresponding layers of the) images. In our implementation, we used the MSELoss() function which is included in PyTorch's Neural Network (nn) module.

Building the Neural Network

To build our model, we used the pre-trained VGG-19 net model which is available within TorchVision's built-in models.

We created our network's architecture by starting with a Normalization layer (converting pixel values into z-scores) and then added each layer from the VGG-19 net one at a time. After every convolution layer, we added a content-loss layer which simply computes the content-loss (as described earlier) during a forward pass of the network. Finally, we trimmed off all layers (ReLU, Normalization, etc.) after the last content-loss layer.

Optimization

To optimize the pre-trained VGG-19 net with added Normalization and Style/Content Loss layers, we used the LBFGS optimizer. We perform this optimization a number of times during a training run, which involves going through the following steps:

  1. Clamping the input image pixel values between 0 and 1
  2. Clearing the gradients by setting them to 0
  3. Passing the input image through the model
  4. Computing the weighted content loss and its gradient
  5. Returning the loss

In addition to the steps mentioned above, we also print the (unweighted) content loss every 10 iterations and clamp the input image between 0 and 1 one last time at the end of the optimzation. Additionally, the output image is saved as a png file every 100 iterations.

Experiments

As discussed earlier, content loss is applied at a specific layer of the network. Accordingly, we tried applying the content-loss function at each layer of the VGG-19 net model (5 in total). The results are shown below.

Layer 1 Reconstruction Layer 2 Reconstruction Layer 3 Reconstruction Layer 4 Reconstruction Layer 5 Reconstruction
Figure 2: Content Reconstruction w/ Content Loss at Layer 1, 2, 3, 4 & 5 (from left to right)

As seen above, the reconstructed image gets worse (more noisy) when it is applied to latter layers of the network; this is because the image output at those layers are more abstracted from the original input image. This is clear from the minimum content loss obtained for content reconstruction at each layer: 0.0, 0.000177, 0.109410, (these values are for a 300-step optimization). Accordingly, we chose to use content loss at the second layer (conv_2) since that's the deepest we can go without noticeable noise.

Here are a few more examples of content reconstruction for different content images using 2 random-noise image inputs each trained for 300 steps:

Original Image

Reconstructed Image #1

Reconstructed Image #2

Original Input #1 Reconstructed Image #1 Reconstruction Image #1 Original Input #2 Reconstructed Image #2 Reconstruction Image #3 Original Input #3 Reconstructed Image #3 Reconstruction Image #3

As see above, the reconstructed images are slightly blurry, though quite similar to the original image. This may be due to the fact that content reconstruction was performed at layer 2, which is slighly abstracted away from the original image. Additionally, even though the two reconstructed images for each example started off as randomly-generated noise, they both look almost identical; this is because the minimum content loss obtained for either of these images was extremely small (around 0.0001). This shows that our content reconstruction operation is working as expected.

Texture Synthesis

This time around, we focused on style-space loss; the steps were similar to those in the previous section. Below is the image used for the texture synthesis experiments; the image was (manually) reshaped to have the same dimensions as the content image (since they need to match so that pixel-by-pixel loss can be computed).

Input Image
Figure 3: Input Image for Texture Synthesis

Defining Style loss

To compute the difference in styles between two images, we used the Gram matrix followed by the mean-squared error (MSE) function as in content-loss calculation.

Adding Style-Loss Layers

In place of adding content-loss layers in the pre-trained VGG-19 net model, we now used style-loss layers; everything else is exactly as it was in the previous section.

Optimization

Finally, we added the optimization step for style-space loss. Rather than creating a separate optimization function, we simply added two flags—use_content & use_style—to select the right losses for calculation.

Experiments

Just like with content loss, we tried applying the style-loss function at each layer of the VGG-19 net model. Results are shown below.

Layer 1 Texture Sythesis Layer 2 Texture Sythesis Layer 3 Texture Sythesis Layer 4 Texture Sythesis Layer 5 Texture Sythesis
Figure 4: Texture Sythesis w/ Style Loss at Layer 1, 2, 3, 4 & 5 (from left to right)

As seen above, applying style loss optimization to different layers of the VGG-19 net model results in different textures. Among these, the layer-2 optimization output seems to be the smoothest (the layer-1 output is a close second); there is a varying degree of noise in the other outputs. Accordingly, we decided to use style loss on layer 1 & 2. The result for this combined loss optimization is as follows:

Layer 1 & 2 Texture Sythesis
Figure 5: Texture Sythesis w/ Style Loss at Layers 1 & 2

Below are a few more examples of texture synthesis for different style images using 2 random-noise image inputs each trained for 300 steps (layer 1 & 2 are used as before):

Original Image

Texture #1

Texture #2

Original Input #1 Texture #1 Texture #1 Original Input #2 Texture #2 Texture #3 Original Input #3 Texture #3 Texture #3

The textures generated resemble the colors in the original input images. Since random noise was used as the input, the two texture images generated for each example are different, though the patterns are quite similar. Accordingly, the texture synthesis step is also working as expected.

Style Transfer

After having tested both content-space loss and style-space loss independently, we now put them both together to perform style transfer. The results are included in this section.

Hyperparameters Used

The following values were used for generating the results in this section:

  • Content-Loss Layers: Layer 2
  • Style-Loss Layers: Layer 1 & 2
  • Content Weight: 1
  • Style Weight: 1000000
  • Input Image: White Noise
  • Number of Optimization Steps: 1500 (sharper outputs may be obtained by training for longer)

Content vs Style Grid

Style Input #1 Style Input #2 Content Input #1 Content #1 + Style #1 Content #1 + Style #2 Content Input #2 Content #2 + Style #1 Content #2 + Style #2

Noise vs Content Image as Input

We also compared the difference between using the content image vs white noise as the input image (up till this point of the assignment, only white noise was used). Below are the content and style images used.

Content Image Style Image
Figure 6: Content Image (left) and Style Image (right)

For both possible input images, we perform the style transfer operation for 100 optimization steps. Below are the outputs.

Noise Output Content Image Output
Figure 7: Output using Noise as Input (left) vs Output using Content Image as Input (right)

The runtimes for the style transfer using noise vs content image as input were 93.78 and 102.66 seconds, respectively (these runtimes are large even for 100 steps since a CPU was used). Using the content image as the input makes the optimization step take slightly longer (though they are still comparable) but the output is a lot better than when using noise as the input. The left image looks very noisy as compared to the right since the Neural Net has not yet (at 100 steps) understood the content of the image; using the content as the input skips this step, making the output look smooth every early on in training.

More Results

(Note: White noise was used as input in these images.)

Content Image

Style Image

Output Image

Content #1 Style #1 Output #1 Content #2 Style #2 Output #2 Content #3 Style #3 Output #3