CMU 16-726: Learning Based Image Synthesis

Neural Style Transfer

Maneesh Bilalpur(mbilalpu)

Overview

In this assignment, we implement neural style transfer which resembles specific content in a certain artistic style. For example, generate cat images in Ukiyo-e style. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images in content and style distance space. In the first part of the assignment we optimize a noisy(random) image to generate a given content image. Then we generate textures from noise image and finally combine them to perform style transfer of given content image in the input style image.

Content Loss

We optimize for content loss(MSE between real image and noisy image) in the feature space of a pretrained computer vision model(here, VGG-19). We present the results of optimizing at different layers below

Image optimized for 1st conv layer with contentloss — Reconstructed image optimized at conv1, conv2, conv3, conv4 and conv1+conv4 layers respectively.

Image optimized for 2nd conv layer with contentloss — Reconstructed image optimized at conv1, conv2, conv3, conv4 and conv1+conv4 layers respectively.

We observe no significant difference between optimizing at different layers. However, we observe that conv2 optimization leads to a poor reconstruction. We believe that this is due to the initialization of the optimization problem. This is something we will demonstrate again in this work.

Comparing two noisy initializations

Image optimized for conv1+conv5 layer with styleloss — Content synthesized images optimized at conv1+conv4 but with different noisy initializations.

Image optimized for conv1+conv5 layer with styleloss(reinitialized) — Content synthesized images optimized at conv1+conv4 but with different noisy initializations.

We observe no significant texture differences between the images that only differ by the initialization of the optimization problem.

Texture synthesis

We optimize for texture synthesis on noisy image by optimizing for MSE between gram matrices of the noisy input and input style.

style loss input image — Picasso style input image

Image optimized for 1st conv layer with styleloss — Texture synthesized image optimized at conv1, conv2, conv3, conv4, conv5 and conv1+conv5 layers respectively.

Image optimized for 2nd conv layer with styleloss — Texture synthesized image optimized at conv1, conv2, conv3, conv4, conv5 and conv1+conv5 layers respectively.

We observe significant texture differences with respect the choice of features for optimization. One could observe that in the initial layers the strokes are really small and hence we see a speckled image. While at the extreme end, we observe that the strokes are longer and have more shape information.

Comparing two noisy initializations

We observe no significant texture differences between the images that only differ by the initialization of the optimization problem. However we can notice that the differences in brush strokes exist--Left image has wider strokes than the right image. The initialization differences broadly suggest that differences in style and content can be observed and the LBFGS optimization is sensitive to initialization. I personally also observed instances of "nan" errors due to the same reason.

Style Transfer

Training

We combine the content and style losses from above problems to perform style transfer. We formulate it as weighted combination of the both losses. Note that the gram matrices are normalized with the size of the features maps. While tuning the hyperparameters I observed that the best synthesis is observed for fewer iterations with picasso style images when compared to starring night style images. This is prevalent with using noisy image or content image as input. This suggests that certain styles are easier to synthesize over others.

Examples

Coke can transferred into pepsi bottle — Left-to_right: Style, Content and Style transferred image.

Both of the above examples are initialized with content image.

Random noise vs. Content image

When we use a random noise instead of the content image, we observe similar results. We compare the outputs from noisy input and content image below

Both approaches have been run for 1000 steps so the outputs are comparable. Both outputs look a lot similar. But the texture at smooth surfaces(see the railings to the front and sides of the building) are consistent with content images when compared to noise images. In other words, it would take more iterations(and hence runtime) with noise image as input when compared to using content image.

On other instances, I observed a mode-failure(note that the dome and the left building structures are faintly visible) like situation with noisy input when compared to content image. I think when access to content image is affordable it should be the preferred input for these reasons.

Style Transfer on Favourite images

Most styles in the provided examples are pretty restricted in their texture and color spectrum(picasso used shades of yellow and blue-gray only). In this example I try to convert a coke can into a pepsi bottle. The inputs differ in the nature of the style--indoor vs outdoor, can vs bottle. Interesting the model fails in style transfer despite training for about 1500 iterations.

Images are not provided for the assignment are stock photos from google.
Overview from the assignment website here.
Website template copied from here.