In this assignment, We will implement a few different techniques to manipulate images on the manifold of natural images. First, we will invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the assignment, we will take a hand-drawn sketch and generate an image that fits the sketch accordingly.
We will solve an optimization problem to reconstruct the image from a particular latent code. Natural images lie on a low-dimensional manifold. We choose to consider the output manifold of a trained generator as close to the natural image manifold. So, we can set up the following nonconvex optimization problem: For some choice of loss $L$ and and trained generator $G$ and a given real image $x$, we can write: $$z^{*}=\arg \min _{z} \mathcal{L}(G(z), x).$$ For loss function, we use $L_2$ losses as well as some combination of the perceptual (content) losses from VGG16 Net features.
|
Model, latent sapce type | Perceptual Loss Weight | 10% | 30% | 50% | 70% | 90% |
Vanilla GAN, Z space | iter 0 | |||||
iter 1000 |
|
|
|
|
|
|
StyleGAN, Z space | iter 0 | |||||
iter 1000 |
|
|
|
|
|
|
StyleGAN, W space | iter 0 | |||||
iter 1000 |
|
|
|
|
|
|
StyleGAN, W+ space | iter 0 | |||||
iter 1000 |
|
|
|
|
|
Here we use the combination of $L_2$ loss and perceptual loss as our loss function. While $L_2$ loss constrain on raw pixel value, trying to push the reconstucted image similar to the target image in pixel color value, perceptual loss constrain on content, trying to reconstuct similar cat figure structure and pose. Thus, we can see from the images above, from left to right, with perceptual loss weight increasing, the color value may not likely to be similar, but the cat figure and pose gradually become more similar.
Comparing results from vanilla GAN and StyleGAN, apparently the reconstruction from StyleGAN is better. The cats from vanilla GAN are blurred and some parts are even twisted, while cats from StyleGAN are very clear and detailed. StyleGAN with W/W+ space latent vector outperformed StyleGAN with Z space latent vector, but it's hard to tell which whether W or W+ is better. For W space, the combination of StyleGAN and 70% perceptual loss seem to give the best result (most accurate pose), and it takes 229.05s to reconstruct the cat. For W+ space, the combination of StyleGAN and 10% perceptual loss seem to give the best result, and it takes 225.08s to reconstruct the cat.
Now that we have a technique for inverting the cat images, we can do arithmetic with the latent vectors we have just found. One simple example is interpolating through images via a convex combination of their inverses. More precisely, given images $x_1$ and $x_2$, compute $z_1=G^{-1}(x_1)$, $z_2=G^{-1}(x_2)$. Then we can combine the latent images for some $\theta \in(0,1)$ by $z^{\prime}=\theta z_{1}+(1-\theta) z_{2}$ and generate it via $x'=G(z')$. Choose a discretization of $(0,1)$ to interpolate our image pair.
|
|
|
|
|
|
|
|
|
|
As shown in the gif image, the cats gradually transit from the source cats appearance to the destination cats appearance. The generated fake transition images look pretty realistic. But we can also conclude that the interpolation performance depends on the quality of latent vectors (how well they can reconstruct the original cat). The first row is an example of good performance, while the latent vector of the destination image in the second row can not perfectly reconsturct the destination cat, the transition process doesn't end up perfectly matching the destination cat.
Next, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem, but could be many other things as well.
Color Scribble Constraints: Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image $s \in \mathbb{R}^{d}$ with a corresponding mask $m \in (0,1) d^{d}$. Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like $m_{i} x_{i}=m_{i} s_{i}$. Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to $$z^{*}=\arg \min _{z}\|M * G(z)-M * S\|^{2},$$ where $*$ is the Hadamard product, $M$ is the mask, and $S$ is the sketch. The results are shown below. (Since the Scribble to Image generation process has a lot of randomness, we only show some good ones here.)
Scribble | |||||
Generated Cat |
From left to right, the scribble becomes denser, and the generated cats transits from more realistic style to more painting style (more blurred and pale). It aligns with the intuition that denser scribble with more color would add more constraints on the latent vector, making it deviates more from the original cat manifold space, thus leading to cat images showing color with high saturability, like what in the scribbles.