Maneesh Bilalpur(mbilalpu)
In this assignment, we implement neural style transfer which resembles specific content in a certain artistic style. For example, generate cat images in Ukiyo-e style. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images in content and style distance space. In the first part of the assignment we optimize a noisy(random) image to generate a given content image. Then we generate textures from noise image and finally combine them to perform style transfer of given content image in the input style image.
We optimize for content loss(MSE between real image and noisy image) in the feature space of a pretrained computer vision model(here, VGG-19). We present the results of optimizing at different layers below
We observe no significant difference between optimizing at different layers. However, we observe that conv2 optimization leads to a poor reconstruction. We believe that this is due to the initialization of the optimization problem. This is something we will demonstrate again in this work.
We observe no significant texture differences between the images that only differ by the initialization of the optimization problem.
We optimize for texture synthesis on noisy image by optimizing for MSE between gram matrices of the noisy input and input style.
We observe significant texture differences with respect the choice of features for optimization. One could observe that in the initial layers the strokes are really small and hence we see a speckled image. While at the extreme end, we observe that the strokes are longer and have more shape information.
We observe no significant texture differences between the images that only differ by the initialization of the optimization problem. However we can notice that the differences in brush strokes exist--Left image has wider strokes than the right image. The initialization differences broadly suggest that differences in style and content can be observed and the LBFGS optimization is sensitive to initialization. I personally also observed instances of "nan" errors due to the same reason.
We combine the content and style losses from above problems to perform style transfer. We formulate it as weighted combination of the both losses. Note that the gram matrices are normalized with the size of the features maps. While tuning the hyperparameters I observed that the best synthesis is observed for fewer iterations with picasso style images when compared to starring night style images. This is prevalent with using noisy image or content image as input. This suggests that certain styles are easier to synthesize over others.
Both of the above examples are initialized with content image.
When we use a random noise instead of the content image, we observe similar results. We compare the outputs from noisy input and content image below
Both approaches have been run for 1000 steps so the outputs are comparable. Both outputs look a lot similar. But the texture at smooth surfaces(see the railings to the front and sides of the building) are consistent with content images when compared to noise images. In other words, it would take more iterations(and hence runtime) with noise image as input when compared to using content image.
On other instances, I observed a mode-failure(note that the dome and the left building structures are faintly visible) like situation with noisy input when compared to content image. I think when access to content image is affordable it should be the preferred input for these reasons.
Most styles in the provided examples are pretty restricted in their texture and color spectrum(picasso used shades of yellow and blue-gray only). In this example I try to convert a coke can into a pepsi bottle. The inputs differ in the nature of the style--indoor vs outdoor, can vs bottle. Interesting the model fails in style transfer despite training for about 1500 iterations.
Images are not provided for the assignment are stock photos from google.