This project focuses on implementing neural style transfer, an algorithm that generates images blending the content of one image with the artistic style of another. The process involves optimizing a third input image so that it simultaneously resembles the content of a target image and the style of another.
The assignment has 4 parts: 1. Content Reconstruction 2. Texture Synthesis 3. Style Transfer 4. Bells & Whistles (Extra Points)
Content Image
Conv_3
Conv_5
Conv_8
Conv_11
My preferred choice is optimizing the content loss at 'conv_3', as it is less sensitive to minor variations in low-level inputs, resulting in a more stable optimization process and a consistently decreasing content loss. Moreover, compared to deeper layers, it better preserves fine details and yields superior reconstruction quality.
At this convolution layer, the reconstruction remains stable, showing only subtle variations across different noise inputs.
Content Image
Noise 1
Noise 2
Layers 1–5 retain large color blocks in the image, but fail to capture fine stylistic details. Layers 5–10 results in a more appropriate texture details, but loses a significant amount of low-level information. Using layers 1–10 balances both and performs well; adding more (up to 15) doesn't help much.
→ Final choice: layers 1–10.
Texture Image
Layers 1–5
Layers 5–10
Layers 1–10
Layers 1–15
Under the same style but different random initializations, the local details and positions of the textures vary, while the overall distribution and stylistic appearance remain similar.
(1) Noise 1:
Layers 1–5
Layers 5–10
Layers 1–10
Layers 1–15
(2) Noise 2:
Layers 1–5
Layers 5–10
Layers 1–10
Layers 1–15
Take the content image as input, I use: num_steps=300, style_weight=1e6, content_weight=1, torch.manual_seed(92) .
content_layers_default = ['conv_3']
style_layers_default = ['conv_1', 'conv_2', 'conv_3', 'conv_4', 'conv_5', 'conv_6', 'conv_7', 'conv_8', 'conv_9', 'conv_10']
style_weight=1e1, content_weight=1
style_weight=1e3, content_weight=1
style_weight=1e6, content_weight=1
style_weight=1e9, content_weight=1
Texture 1
Texture 2
Content 1
1-1
1-2
Content 2
2-1
2-2
With the same hyper-parameters, using the content image as input produces good reconstruction by step 300, showing both content and style clearly.
Content image input. step=300
When the input is noise, only a small amount of content appears at step 300, but as the steps increase to 1000 or 3000, more content is gradually recovered. Around step 3000, the result improves, though it still doesn’t match the quality of using the content image as input. Plus, it takes ~10 times longer.
Noise input. step=300
Noise input. step=1000
Noise input. step=3000
Noise input. step=6000
Furthermore, for noise input, setting style_weight = 1e6 and content_weight = 100 yields decent results by step 1000—but still slightly worse than with content image input.
Noise input. step=1000, style_weight = 1e6 and content_weight = 100
Content
Texture
Result
Content
Texture
Result
Content
Texture
Result
Content
Texture
Result
Apply style transfer to each video frame, while initializing each frame with a blend of the current content and the previous stylized frame. Additionally, enforce a temporal consistency loss (MSE) to smooth transitions between frames.