Project 5 - GAN Based Image Editing
GANs are great at generating images from a distribution given a random latent variable z. However, sometimes we want to control the output of GANs for practical purposes, enter GAN inversion and GAN based image editing. We pose GAN inversion as an optimization problem that optimzes the loss of and image w.r.t. a target image on the latent variables of the GAN. For vanilla GAN, this would be the z vector. For stylegan, we have choices of z, w, w+ to optimize. Not only can GAN inversion be used for image reconstruction, it can also be used for many down stream tasks such as interpolating between two images and optimizing w.r.t. sketches.
GAN Inversion
We invert the GAN by optimizing for perceptual / lp loss on a target image over the latent variable z, w, and w+. I implemented the optimization loop with Adam optimizer with learning rate of .001 for 1000 steps. I find that this is a good balance between performance and compute time. We show results for different GAN models, losses, and latent variables.
Figure. Vanilla GAN.
- We optimize vanilla GAN over z. We see that vanilla GAN with z is not super expressive and is unable to reconstruct the target image well.
- For this experiment we used perceptual loss in layers conv_1, conv_2, conv_3, and conv_4.
- While the reconstruction is not that good. We see some shifts towards the target image, such as the scale of cat face.
target|init|result
target|init|result
target|init|result
Figure. StyleGAN loss ablation.
- We optimize StyleGAN over w+ w.r.t. various losses such as perceptual, lp loss, and different perceptual layers.
- For this experiment we used perceptual loss in layers conv_1, conv_2, conv_3, and conv_4.
- We find that a combination of conv_1, conv_2, conv_3, and conv_4 achieved best results with w+. We note that only using earlier conv layers does not lead to the best result. This may be because larger and more abstract features are needed to guide the GAN latent space.
- We find that adding l1 loss to perceptual loss blurs high frequency details, notice the grass for the middle column cat.
- We see that if we add a l1 regularization term to the delta in latent space, then for initial images that are far away, we get a less detailed reconstruction.
target|init|result
conv_123
target|init|result
conv_123
target|init|result
conv_123
target|init|result
conv_135
target|init|result
conv_135
target|init|result
conv_135
target|init|result
conv_1234
target|init|result
conv_1234
target|init|result
conv_1234
target|init|result
conv_1234_l1
target|init|result
conv_1234_l1
target|init|result
conv_1234_l1
target|init|result
conv_1234_reg
target|init|result
conv_1234_reg
target|init|result
conv_1234_reg
Figure. Modes and latent ablation.
- We show the various latent variables to optimize for StyleGAN, z, w, and w+. All with conv_1234 perceptual loss.
- We see that optimizing w+ gives us the best reconstruction. This makes sense because w+ is the most flexible, giving modulation per layer. We see that w is good but gives more blurry results. This makes sense since w does not have per layer modulation for high frequency details. We see that z gives a good approximiation but increases artifacts.
- My favorite space is w+ bc it is the most expressive.
target|init|result
w+
target|init|result
w+
target|init|result
w+
target|init|result
w
target|init|result
w
target|init|result
w
target|init|result
z
target|init|result
z
target|init|result
z
Interpolation
We can interpolate between two images by interpolating the latent space between two GAN reconstructions. This is super cool and found in some popular apps such as FaceApp.
Figure. Interpolation.
- We interpolate in vanilla GAN z space, StyleGAN z, w, and w+ space.
- We see that z space gives us rough interpolation between two cats.
- We see that w and w+ space are more expressive, giving better results. Notice how w+ captures the red cart for the final row, giving the best interpolation.
gif
vanilla_z
gif
vanilla_z
gif
vanilla_z
gif
stylegan_z
gif
stylegan_z
gif
stylegan_z
gif
stylegan_w
gif
stylegan_w
gif
stylegan_w
gif
stylegan_w+
gif
stylegan_w+
gif
stylegan_w+
Scribble to Cat
We try to invert the cat GAN to minimize l1 and perceptual loss on sketches. Since the sketches are sparse, we use a mask on both the losses. Since the sketches are not in the domain of the cat images, I use .00001 weight on perceptual loss and 1 weight on pixel wise l1 loss. This encourages the color of the generated images to match the sketch. We use mean initialization for the best results to start with an average cat.
Figure. Scribble to cat. Vanilla.
- We perform sketch to cat in vanilla gan z space, stylegan z, w, w+.
- We see that vanilla z space is surprisingly good at sketch2cat. This may be because the more abstract z space is a good fit since the sketch is pretty abstract.
- We see that larger facial patterns and rough colors are transferred from sketch to cat.
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
Figure. Scribble to cat. StyleGAN.
- We perform sketch to cat in stylegan z, w, w+.
- We see that StyleGAN z space is not a good space to optimize. A lot of artifacts are introduced when optimizing from mean z.
- We see that StyleGAN w space is a lot better at sketch2cat. This may be because w space is more expressive and allows for easier optimization. We see rough facial structures of the cats being transferred. We also see that color is generally transferred but finer details are missed.
- We see that the w+ space is the most expressive and transfers a lot from the sketches to the cat images. However, for some scenes, when the sketch colors are uncommon and not found in the distribution (see the cat with the pink ring), artifacts are produced, resulting in a blue ring.
- We see that rough color patterns work better than fine, sparce, sketches for sketch2image. Notice how darker regions from sketches usually result in darker images in the output.
scribble2cat
stylegan_z
scribble2cat
stylegan_z
scribble2cat
stylegan_z
scribble2cat
stylegan_z
scribble2cat
stylegan_z
scribble2cat
stylegan_z
scribble2cat
stylegan_w
scribble2cat
stylegan_w
scribble2cat
stylegan_w
scribble2cat
stylegan_w
scribble2cat
stylegan_w
scribble2cat
stylegan_w
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
Figure. Scribble to cat. Vanilla GAN. Custom sketches.
- We perform sketch to cat with custom sketches.
- We see that vanilla GAN z space is not very expressive. The details from my harder custom sketches are not transferred well. Further more, no grass is generated from green patches.
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
scribble2cat
vanilla_z
Figure. Scribble to cat. StyleGAN. Custom sketches.
- We perform sketch to cat with custom sketches in stylegan w+ space.
- We see that stylegan w+ is really expressive. The generated outputs match the sketch in l1 pixel distance loss pretty well. Notice how green sketches have the whole entire image slightly tinted. This may be because it is hard to find the w+ distribution that contains cats with grass and the optimization got stuck in a local minima.
- Notice the custom sketch that the cat is on the right side of the image. The generated image indeed match well color wise, but the facial features of the cat are blurred. This may be due to the w+ space overfitting to pixel loss and deviating too far from the cat distribution.
- We see that the best results have a lot of empty space where loss is not computed (bottom left). It seems like patches of color does not work well as input to stylegan w+ space optimization because it tends to overfit to the patches.
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+
scribble2cat
stylegan_w+