Carnegie Mellon University, 16-726 Learning-Based Image Synthesis, Spring 2021
Task
The goal of this project was to implement neural style transfer; specifically, we attempted to recreate a certain content in a different art style. We did this by taking two input images—one for content and the other for style—and optimizing the corresponding content & style losses against the network's output.
Content Reconstruction
For the initial setup, we implemented the content-space loss alone and studied the output of the Neural Net. The following image was used for the content reconstruction experiments.
Defining Content loss
Content loss is used to measure the difference in content between two images; this is typically measured at a certain layer of the network. We define this loss as the mean-squared error (MSE) between the (corresponding layers of the) images. In our implementation, we used the MSELoss() function which is included in PyTorch's Neural Network (nn) module.
Building the Neural Network
To build our model, we used the pre-trained VGG-19 net model which is available within TorchVision's built-in models.
We created our network's architecture by starting with a Normalization layer (converting pixel values into z-scores) and then added each layer from the VGG-19 net one at a time. After every convolution layer, we added a content-loss layer which simply computes the content-loss (as described earlier) during a forward pass of the network. Finally, we trimmed off all layers (ReLU, Normalization, etc.) after the last content-loss layer.
Optimization
To optimize the pre-trained VGG-19 net with added Normalization and Style/Content Loss layers, we used the LBFGS optimizer. We perform this optimization a number of times during a training run, which involves going through the following steps:
Clamping the input image pixel values between 0 and 1
Clearing the gradients by setting them to 0
Passing the input image through the model
Computing the weighted content loss and its gradient
Returning the loss
In addition to the steps mentioned above, we also print the (unweighted) content loss every 10 iterations and clamp the input image between 0 and 1 one last time at the end of the optimzation. Additionally, the output image is saved as a png file every 100 iterations.
Experiments
As discussed earlier, content loss is applied at a specific layer of the network. Accordingly, we tried applying the content-loss function at each layer of the VGG-19 net model (5 in total). The results are shown below.
As seen above, the reconstructed image gets worse (more noisy) when it is applied to latter layers of the network; this is because the image output at those layers are more abstracted from the original input image. This is clear from the minimum content loss obtained for content reconstruction at each layer: 0.0, 0.000177, 0.109410, (these values are for a 300-step optimization). Accordingly, we chose to use content loss at the second layer (conv_2) since that's the deepest we can go without noticeable noise.
Here are a few more examples of content reconstruction for different content images using 2 random-noise image inputs each trained for 300 steps:
Original Image
Reconstructed Image #1
Reconstructed Image #2
As see above, the reconstructed images are slightly blurry, though quite similar to the original image. This may be due to the fact that content reconstruction was performed at layer 2, which is slighly abstracted away from the original image. Additionally, even though the two reconstructed images for each example started off as randomly-generated noise, they both look almost identical; this is because the minimum content loss obtained for either of these images was extremely small (around 0.0001). This shows that our content reconstruction operation is working as expected.
Texture Synthesis
This time around, we focused on style-space loss; the steps were similar to those in the previous section. Below is the image used for the texture synthesis experiments; the image was (manually) reshaped to have the same dimensions as the content image (since they need to match so that pixel-by-pixel loss can be computed).
Defining Style loss
To compute the difference in styles between two images, we used the Gram matrix followed by the mean-squared error (MSE) function as in content-loss calculation.
Adding Style-Loss Layers
In place of adding content-loss layers in the pre-trained VGG-19 net model, we now used style-loss layers; everything else is exactly as it was in the previous section.
Optimization
Finally, we added the optimization step for style-space loss. Rather than creating a separate optimization function, we simply added two flags—use_content & use_style—to select the right losses for calculation.
Experiments
Just like with content loss, we tried applying the style-loss function at each layer of the VGG-19 net model. Results are shown below.
As seen above, applying style loss optimization to different layers of the VGG-19 net model results in different textures. Among these, the layer-2 optimization output seems to be the smoothest (the layer-1 output is a close second); there is a varying degree of noise in the other outputs. Accordingly, we decided to use style loss on layer 1 & 2. The result for this combined loss optimization is as follows:
Below are a few more examples of texture synthesis for different style images using 2 random-noise image inputs each trained for 300 steps (layer 1 & 2 are used as before):
Original Image
Texture #1
Texture #2
The textures generated resemble the colors in the original input images. Since random noise was used as the input, the two texture images generated for each example are different, though the patterns are quite similar. Accordingly, the texture synthesis step is also working as expected.
Style Transfer
After having tested both content-space loss and style-space loss independently, we now put them both together to perform style transfer. The results are included in this section.
Hyperparameters Used
The following values were used for generating the results in this section:
Content-Loss Layers: Layer 2
Style-Loss Layers: Layer 1 & 2
Content Weight: 1
Style Weight: 1000000
Input Image: White Noise
Number of Optimization Steps: 1500 (sharper outputs may be obtained by training for longer)
Content vs Style Grid
Noise vs Content Image as Input
We also compared the difference between using the content image vs white noise as the input image (up till this point of the assignment, only white noise was used). Below are the content and style images used.
For both possible input images, we perform the style transfer operation for 100 optimization steps. Below are the outputs.
The runtimes for the style transfer using noise vs content image as input were 93.78 and 102.66 seconds, respectively (these runtimes are large even for 100 steps since a CPU was used). Using the content image as the input makes the optimization step take slightly longer (though they are still comparable) but the output is a lot better than when using noise as the input. The left image looks very noisy as compared to the right since the Neural Net has not yet (at 100 steps) understood the content of the image; using the content as the input skips this step, making the output look smooth every early on in training.
More Results
(Note: White noise was used as input in these images.)