Content image (left): Fallingwater, place of interest near Pittsburgh. Style image (middle): the art Self-Portrait with Thorn Necklace and Hummingbird by Frida Kahlo Output (right): Frida-Kahlo-ized Fallingwater.

Introduction

In this assignment, we implement neural style transfer, which takes in style and content images and transforms an input image into a new image through optimization of a loss function that is weighted by the style and content of the corresponding input images.

In the first part of the assignment, we start from random noise and optimize it in content space. Later we optimize only on the style to generate new textures. This builds some intuitive connection between style-space distance and gram matrix. Lastly, we combine all of these pieces to perform neural style transfer.

Part 1: Content Reconstruction

For the first part of the assignment, we have a content space loss and we optimize a random noise with respect to this content loss only.

Content Loss: The content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as fXL and that of target content image as fCL. The content loss is defined as squared L2-distance of these two features |fXLfCL|22.

The code for content loss looks as follows:

class ContentLoss(nn.Module):

  def __init__(self, target,):
      super(ContentLoss, self).__init__()
      # you need to `detach' the target content from the graph used to
      # compute the gradient in the forward pass that made it so that we don't track
      # those gradients anymore
      self.target = target.detach()

  def forward(self, input):
      # this needs to be a passthrough where you save the appropriate loss value
      self.loss = F.mse_loss(input, self.target)
      return input

Feature extractor and Loss Insertion: Of course when L equals 0, content-loss is just an L2 pixel loss, which does not resemble content. The content loss is actually in the feature space. To extract the feature, a VGG-19 net pre-trained on ImageNet is used. The pre-trained VGG-19 net consists of 5 convolution blocks (conv1-conv5) and each block serves as a feature extractor at the different abstract levels.

The pre-trained VGG-19 can be directly imported from torchvision.models module and the individual modules are loaded from the "features" part of the model that corresponds to the convolutional and pooling layers (non-densely connected). We can get the output after a given convolutional blocks by taking the output of the network before every maxpool step. Alternatively we can get the output after every single convolutional layer as per the PyTorch Neural Style Transfer tutorial. Doing these two techniques gives different results since the point in the network where the features are being considered is quite different.

Optimization: Here the neural network is fixed and the pixel values in the image are optimized instead. We use a quasi-newton optimizer LBFGS to optimize the image optimizer = optim.LBFGS([input_img.requires_grad_()]). The optimizer involves reevaluate your function multiple times so rather than a simple loss.backward(), we need to specify a hook closure that performs 1) clear the gradient, 2) compute loss and gradient 3) return the loss.

Experiment:

  1. Optimizing the content loss at different layers provides diminishing returns/quality. In the first couple of layers/blocks, the optimization with N=300 steps (default) is able to entirely reconstruct the original content image. As we move deeper into the network (whether in the form of layers or blocks), the resulting content is more difficult to reconstruct, and this probably because the deeper layers encode more abstract representations of the original data, and so reconstructing the content requires more optimization.
  2. I would say that the best result is in one of the early blocks (conv1 or conv2). With random input noise, here are some sample results after optimization of the content loss placed after conv block 1 (conv layer 2).

It can be seen that the results above from the reconstruction are almost identical to the original content image in both cases. However, if we use some deeper conv blocks/layers, the results tend to be a lot less identical, and often have a lot of noise. The reason for this seems to be a difficulty to optimize over such a large number of variables that are involved in estimating the deeper layer abstractions of the neural network.

Part 2: Texture Synthesis

Style loss is measured using the Gram matrix. Gram matrix is the correlation of two vectors on every dimension. Specifically, denote the k-th dimension of the Lth-layer feature of an image as fkL in the shape of (N,K,HW). Then the gram matrix is G=fkL(fkL)T in the shape of (N, K, K). The idea is that two of the gram matrix of our optimized and predicted feature should be as close as possible.

This style loss based on the gram matrix can be implemented as follows:

def gram_matrix(activations):
  a, b, c, d = activations.size()  # a=batch size(=1)
  # b=number of feature maps
  # (c,d)=dimensions of a f. map (N=c*d)
  features = activations.view(a*b, c*d)

  gram = torch.mm(features, features.t())

  # 'normalize' the values of the gram matrix
  # by dividing by the number of element in each feature maps.

  return gram.div(a*b*c*d)

class StyleLoss(nn.Module):

  def __init__(self, target_feature):
      super(StyleLoss, self).__init__()
      # need to detach and cache the appropriate thing
      self.target = gram_matrix(target_feature).detach()

  def forward(self, input):
      # need to cache the appropriate loss value in self.loss
      G = gram_matrix(input)
      self.loss = F.mse_loss(G, self.target)
      return input

Applying loss: This loss is added to the neural network like with the Content Loss, after certain conv layers/blocks.

Experiment:

  1. Texture loss when optimized at low-level layers causes a poor quality texture to be synthesized. Only when we go deeper into the network do we see more artistic representations of the texture of the provided style image. The images below show the synthesized textures at conv layers 1, 2 and 4.
  2. With two random noise inputs and the same set of style loss layers (1-5), here are some results:
  3. Although at a high level the textures being synthesized are similar, the pattern is different if you look at it closely. This is due to the complexity of the loss surface and the nature of non-convex optimization. The starting point of the random noise image influences the final result, especially considering that we are using only N=300 steps to run this optimization.

Part 3: Style Transfer

Applying Losses: Both style and content losses are applied at the appropriate convolutional blocks/layers.

Experiment:

  1. I used the conv2 layer for the content loss and conv layers 1-5 for the style loss. I also crop the images to the same size after the resizing is done, it isn't optimal but it works for now. Finally, I also tried tuning the style weight from the original provided value to half and one-tenth, to get some different sort of styles as shown in the image below.
  2. Here is a 3x3 grid with the content forming the columns and the styles forming the rows.
  3. Here is a comparison of style transfer when it is run on random noise as an input and the content as the input.
  4. It can be observed that using the content for initialization allows the model to retain some of the edge information while applying the new textures. Whereas starting with the random noise image leads it to focus more on the style optimization rather than the content (borders in the image). There is an overall abstract artistic theme in the noise initialized result, but the one that is more true-to-the-technique is the one obtained from content initialization.

  5. The image below shows some nice results of style transfer on an image I had taken near my apartment in Fall 2019!
  6. Here's another result of style transfer on an image from the Oakland Bridge in San Francisco! Definitely not as nice as the one above, but still quite interesting to see the patterns.

Bells & Whistles

  • Stylized Poisson blended images from the previous homework are shown below:

Further Resources

Acknowledgement: The assignment is credit to Pytorch tutorial neural transfer.