16-726 Learning-Based-Image-Synthesis

Yiwen Zhao's Project Page

Assignment #5 - Cat Photo Editing

Motivation

Edit the style of input image using pretrained model.

Part 1

In the styleGAN paper, the style is sampled from the latent space and combined to the synthesis network through AdaIN, where each feature map xi is normalized separately, then scaled and biased using the corresponding ys,i and yb,i.

There are three choices of latent space:

We should initialize the latent first, then pass the latent through a pretrained Generator, use the optimization-based method to draw the image near the target while keeping the image style faithful to the pretrained model.

The choice of latent space, loss weights, optimizer and data sample leads to results in different quality.

LBFGS seems faster than the Adam optimizer. Since the sample only changes fast in the first several steps but remains roughly the same later in some cases.

img part1_LBFGS_0_stylegan_w_0.1_1 img part1_LBFGS_0_stylegan_w_0.1_2 img part1_LBFGS_0_stylegan_w_0.1_3 img part1_LBFGS_0_stylegan_w_0.1_4 img part1_LBFGS_0_stylegan_w_0.1_5 img part1_LBFGS_0_stylegan_w_0.1_995

w latent, loss weight: perc 0.1 | l1 0.01 | delta 0.1, LBFGS, step 1, 2, 3, 4, 5, 995.

img part1_LBFGS_0_stylegan_w_0.1_1 img part1_LBFGS_0_stylegan_w_0.1_2 img part1_LBFGS_0_stylegan_w_0.1_3 img part1_LBFGS_0_stylegan_w_0.1_4 img part1_LBFGS_0_stylegan_w_0.1_5 img part1_LBFGS_0_stylegan_w_0.1_995

w latent, loss weight: perc 0.1 | l1 0.01 | delta 0.1, Adam, step 1, 201, 401, 601, 801, 995.

The results from z/w/w+latent space can all be faithful to the target(content image) in some range, and realistic. In practice, it is harder to find a good set of loss weight in w+. Here shows the results using optimal parameters among all trys.

 

img data

Reference

img escher_sphere img escher_sphere

Left: stylegan, z | Right: vanillagan, z

img escher_sphere img escher_sphere

Left: stylegan, w | Right: vanillagan, w+

Part 2

In part2, we are trying to add content and style to sketch image, while preserving the contour information indicated by the user given sketch.

The sketch image(RGBA) serves as the input, and a corresponding mask(0,1) is generated according to its alpha channel.

We still use two terms of loss here:

The results using latents from different spaces not variate so much under coarse sketch condition, but obvious different in dense sketch.

img coarse_data img coarse_stylegan img styleL_lr001_conv3_1 img styleL_lr001_conv4_1

sketch | latent z | latent w | latent w+

img coarse_data img coarse_stylegan img styleL_lr001_conv3_1 img styleL_lr001_conv4_1

sketch | latent z | latent w | latent w+

In the code implementation, these two losses all support mask. But using mask in feature space(percetual loss) seems weired, because after the conv extraction, the feature belongs to sketch might not be staying at the same place. For denser stroke, this effect might be alleviated.

Part 3

In this part, we need to add noise to the encoded input sketch xt1xt, then denoise from the noisy latent using parameters stored in the pretrained model, which can calculate the mean and variance of each step's distribution xtxt1, plus the cond and uncond guidance. I use the cfg rate=7.

The sample step N is recommended to be 500-700, because the DDPM pretraining uses 1000 steps between the encoded real image and the noise. Thus a N less than 1000 will reserve the low level features of the sketch, which enables the content of sketch to be kept.

img styleL_lr001_conv1_1 img styleL_lr001_conv2_1

sketch1 | sketch2

img styleL_lr001_conv3_1 img styleL_lr001_conv4_1 img styleL_lr001_conv5_1 img styleL_lr001_conv5_1

sketch2, cfg=7, step=500 | sketch1, cfg=7, step=500 | sketch1, cfg=7, step=700 | sketch1, cfg=6, step=500

(My drawing of Grumpy Cat doesn't look grumpy emm)