CMU 16-726: Learning Based Image Synthesis


Neural Style Transfer

Maneesh Bilalpur(mbilalpu)


Overview

In this assignment, we implement neural style transfer which resembles specific content in a certain artistic style. For example, generate cat images in Ukiyo-e style. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images in content and style distance space. In the first part of the assignment we optimize a noisy(random) image to generate a given content image. Then we generate textures from noise image and finally combine them to perform style transfer of given content image in the input style image.

Content Loss

We optimize for content loss(MSE between real image and noisy image) in the feature space of a pretrained computer vision model(here, VGG-19). We present the results of optimizing at different layers below

content loss input image
input image



Image optimized for 1st conv layer with contentloss Image optimized for 2nd conv layer with contentloss Image optimized for 3rd conv layer with contentloss Image optimized for 4th conv layer with contentloss Image optimized for 4th conv layer with contentloss
Reconstructed image optimized at conv1, conv2, conv3, conv4 and conv1+conv4 layers respectively.

We observe no significant difference between optimizing at different layers. However, we observe that conv2 optimization leads to a poor reconstruction. We believe that this is due to the initialization of the optimization problem. This is something we will demonstrate again in this work.




Comparing two noisy initializations

Image optimized for conv1+conv5 layer with styleloss Image optimized for conv1+conv5 layer with styleloss(reinitialized)
Content synthesized images optimized at conv1+conv4 but with different noisy initializations.

We observe no significant texture differences between the images that only differ by the initialization of the optimization problem.




Texture synthesis

We optimize for texture synthesis on noisy image by optimizing for MSE between gram matrices of the noisy input and input style.

style loss input image
Picasso style input image



Image optimized for 1st conv layer with styleloss Image optimized for 2nd conv layer with styleloss Image optimized for 3rd conv layer with styleloss Image optimized for 4th conv layer with styleloss Image optimized for 5th conv layer with styleloss Image optimized for conv1+conv5 layer with styleloss
Texture synthesized image optimized at conv1, conv2, conv3, conv4, conv5 and conv1+conv5 layers respectively.

We observe significant texture differences with respect the choice of features for optimization. One could observe that in the initial layers the strokes are really small and hence we see a speckled image. While at the extreme end, we observe that the strokes are longer and have more shape information.




Comparing two noisy initializations

Image optimized for conv1+conv5 layer with styleloss Image optimized for conv1+conv5 layer with styleloss(reinitialized)
Texture synthesized images optimized at conv1+conv5 but with different initializations.

We observe no significant texture differences between the images that only differ by the initialization of the optimization problem. However we can notice that the differences in brush strokes exist--Left image has wider strokes than the right image. The initialization differences broadly suggest that differences in style and content can be observed and the LBFGS optimization is sensitive to initialization. I personally also observed instances of "nan" errors due to the same reason.




Style Transfer

Training

We combine the content and style losses from above problems to perform style transfer. We formulate it as weighted combination of the both losses. Note that the gram matrices are normalized with the size of the features maps. While tuning the hyperparameters I observed that the best synthesis is observed for fewer iterations with picasso style images when compared to starring night style images. This is prevalent with using noisy image or content image as input. This suggests that certain styles are easier to synthesize over others.

Examples

Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle
Left-to_right: Style, Content and Style transferred image.



Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle
Left-to_right: Style, Content and Style transferred image.



Both of the above examples are initialized with content image.

Random noise vs. Content image

When we use a random noise instead of the content image, we observe similar results. We compare the outputs from noisy input and content image below

Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle


Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle
Top Left-to_right: Style and content images.
Left-to_right: Outputs from content image and noise image as inputs respectively.

Both approaches have been run for 1000 steps so the outputs are comparable. Both outputs look a lot similar. But the texture at smooth surfaces(see the railings to the front and sides of the building) are consistent with content images when compared to noise images. In other words, it would take more iterations(and hence runtime) with noise image as input when compared to using content image.




Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle


Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle
Top Left-to_right: Style and content images.
Bottom Left-to_right: Outputs from content image and noise image as inputs respectively.

On other instances, I observed a mode-failure(note that the dome and the left building structures are faintly visible) like situation with noisy input when compared to content image. I think when access to content image is affordable it should be the preferred input for these reasons.




Style Transfer on Favourite images

Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle
Transforming a family image into picasso art. The texture outweighs the content loss, better output can be observed with increasing the content loss contribution.



Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle
Transforming a pleasant pepsi into an evil-pepsi using coke style image. Could be improved with a better style image(current one has a hand obscuring the background texture).



Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle Coke can transferred into pepsi bottle

Most styles in the provided examples are pretty restricted in their texture and color spectrum(picasso used shades of yellow and blue-gray only). In this example I try to convert a coke can into a pepsi bottle. The inputs differ in the nature of the style--indoor vs outdoor, can vs bottle. Interesting the model fails in style transfer despite training for about 1500 iterations.

Images are not provided for the assignment are stock photos from google.
Overview from the assignment website here.
Website template copied from here.