In this assignment, we implemented a few different techniques that manipulate images on the manifold of natural images. First, we invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the assignment, we take a hand-drawn sketch and generate an image that fits the sketch accordingly.
For the target image, we use this one for all of the different settings.
Looks like the perceptual loss helped with proning out the artifacts a bit to make the image more realistic.
Result looks similar, although styleGAN looks a little better, probably due to better model capability.
I believe using w+ space gives the model more freedom to manipulate the output, and therefore the corresponding image ressembles the target the most among all images. In summary, I believe stylegan_w+_perceptual=0.5 gives us the best image, but its runtime is also the longest.
We followed the instruction and run our algorithm, and show two sets of results below:
I think the interpolation looks pretty good. Nothing I can really complain about.
First, I tested with my own drawing, and get the following result (data, mask, z, w+):
Since my own drawing is so bad, I decide to test with the scratch from the project page
It's actually quite interesting that the w+ space reconstruction completly failed to generate realistic cat. I think this shows us the tradeoff of operating in w+ space: indeed, we enjoy the benifit of having more control over the images, but in the same time we are sacrificing "realisticity". Also, since it actually worked well on my first drawing, I think this actually shows that larger scribbles is going to help in the case of using w+.
We run our projection algorithm on 256*256 obama image using the weight provided in the stylegan paper, and obtained the following results:
Target:
Projection with w:
Projection with w+:
W projection looks realistic, but the facial expression doesn't resembles the original picture. W+ projection has (perhaps?) better details, but it is not realistic at all. I think this again demonstrated the trade-off between using w+ and w, and indicate the perceptual loss is not enough when we are trying to reconstruct facial expression.