16-726 Assignment 4: Neural Style Transfer

-Liting Wen (litingw)-

Part 0. Overview

This project focuses on implementing neural style transfer, an algorithm that generates images blending the content of one image with the artistic style of another. The process involves optimizing a third input image so that it simultaneously resembles the content of a target image and the style of another.

The assignment has 4 parts: 1. Content Reconstruction 2. Texture Synthesis 3. Style Transfer 4. Bells & Whistles (Extra Points)

Part 1: Content Reconstruction

* Optimize content loss at different layers:

Example Image

Content Image

Example Image

Conv_3

Example Image

Conv_5

Example Image

Conv_8

Example Image

Conv_11

My preferred choice is optimizing the content loss at 'conv_3', as it is less sensitive to minor variations in low-level inputs, resulting in a more stable optimization process and a consistently decreasing content loss. Moreover, compared to deeper layers, it better preserves fine details and yields superior reconstruction quality.

* At 'conv_3', given different random noise:

At this convolution layer, the reconstruction remains stable, showing only subtle variations across different noise inputs.

Example Image

Content Image

Example Image

Noise 1

Example Image

Noise 2

Part 2: Texture Synthesis

* Optimize texture loss at different layers:

Layers 1–5 retain large color blocks in the image, but fail to capture fine stylistic details. Layers 5–10 results in a more appropriate texture details, but loses a significant amount of low-level information. Using layers 1–10 balances both and performs well; adding more (up to 15) doesn't help much.

→ Final choice: layers 1–10.

Example Image

Texture Image

Example Image

Layers 1–5

Example Image

Layers 5–10

Example Image

Layers 1–10

Example Image

Layers 1–15

* Given different random noise:

Under the same style but different random initializations, the local details and positions of the textures vary, while the overall distribution and stylistic appearance remain similar.

(1) Noise 1:

Example Image

Layers 1–5

Example Image

Layers 5–10

Example Image

Layers 1–10

Example Image

Layers 1–15

(2) Noise 2:

Example Image

Layers 1–5

Example Image

Layers 5–10

Example Image

Layers 1–10

Example Image

Layers 1–15

Part 3. Style Transfer

* Implementation Details:

Take the content image as input, I use: num_steps=300, style_weight=1e6, content_weight=1, torch.manual_seed(92) .

content_layers_default = ['conv_3']

style_layers_default = ['conv_1', 'conv_2', 'conv_3', 'conv_4', 'conv_5', 'conv_6', 'conv_7', 'conv_8', 'conv_9', 'conv_10']

Example Image

style_weight=1e1, content_weight=1

Example Image

style_weight=1e3, content_weight=1

Example Image

style_weight=1e6, content_weight=1

Example Image

style_weight=1e9, content_weight=1

Results:

Example Image

Texture 1

Example Image

Texture 2

Example Image

Content 1

Example Image

1-1

Example Image

1-2

Example Image

Content 2

Example Image

2-1

Example Image

2-2

* Take input as random noise and a content image respectively:

With the same hyper-parameters, using the content image as input produces good reconstruction by step 300, showing both content and style clearly.

Example Image

Content image input. step=300

When the input is noise, only a small amount of content appears at step 300, but as the steps increase to 1000 or 3000, more content is gradually recovered. Around step 3000, the result improves, though it still doesn’t match the quality of using the content image as input. Plus, it takes ~10 times longer.

Example Image

Noise input. step=300

Example Image

Noise input. step=1000

Example Image

Noise input. step=3000

Example Image

Noise input. step=6000

Furthermore, for noise input, setting style_weight = 1e6 and content_weight = 100 yields decent results by step 1000—but still slightly worse than with content image input.

Example Image

Noise input. step=1000, style_weight = 1e6 and content_weight = 100

* Style transfer on some of custom images:

Example Image

Content

Example Image

Texture

Example Image

Result

Example Image

Content

Example Image

Texture

Example Image

Result

Bells & Whistles (Extra Points)

1. Stylize grump cats: (2pts)

Example Image

Content

Example Image

Texture

Example Image

Result

Example Image

Content

Example Image

Texture

Example Image

Result

2. Apply style transfer to a video, also with temporal smoothness: (4pts)

Apply style transfer to each video frame, while initializing each frame with a blend of the current content and the previous stylized frame. Additionally, enforce a temporal consistency loss (MSE) to smooth transitions between frames.

Input
Output