In the few past years, the quality of images synthesized by GANs has increased rapidly. Compared to the seminal DCGAN framework in 2015, the current state-of-the-art GANs can synthesize at a much higher resolution and produce significantly more realistic images. Among them, StyleGAN makes use of an intermediate W latent space that holds the promise of enabling some controlled image modifications. Image modifications are more exciting when it becomes possible to modify a given image rather than a randomly GAN generated one. This leads to the natural question if it is possible to embed a given photograph into the GAN latent space.
In this project, we will implement a few different techniques to manipulate images on the manifold of natural images. First, we will invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part of the project, we will take a hand-drawn sketch and generate an image that fits the sketch accordingly.
In this part, we are trying to optimize a noise input latent code so that the loss with a given target image is minimized. The process can be viewed as the projection of the target image on a low dimensional manifold.
Here, we are going to explore the influence of 3 major component in the algorithm:
Some details of my implementation are:
We ran the optimization problem solver with 12 different setups mentioned above using the first image from the grumpy cat collection. The resulting reconstructions are shown below:
First, we can see that the vanilla model failed to reconstruct the background color compared with other StyleGAN models. But the running time is much shorter due to the simple architecture of the generator model. If we compare the results from column to column, we can tell that the introduction of perceptual loss improves the quality of the detail reconstruction (like the shape of the eyes). But it also makes it a more complex nonconvex optimization problem to solve. When is large, the resulting images are not as good within the same number of steps. More iterative processes are needed in order to get a better result. Out of all the reconstructions, StyleGAN model using w+ latent space with provides the best quality reconstruction. For a image of , it takes about 17 seconds to run.
In this part, we can do arithmetic with the latent vectors we have just found. Interpolating through images via a convex combination of their inverses can provide the intermediate images between two given ones. More precisely, given images and , compute . Then we can combine the latent images for some by and generate it via .
We ran the optimization with 20 interpolated points in the latent space for the best model identified above. 10000 steps are used during the iterative optimization. The results are demonstrated as a gif between two given images:
We can conclude from the results above that the interpolation in most cases are realistic and smooth. In pair 2, the position of the nose is moving up smoothly. However, unrealistic results are generated with pair 3, where the left eye is gradually generated through the interpolation. The major issue for this pair is that the inital optimization result for image 1 is poor (the left eye is blurred).
In this part, we would like to constrain our image in some way while having it look realistic. This constraint could be color scribble constraints as we initially tackle this problem.
Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image with a corresponding mask . Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like .
Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to:
where is the Hadamard product, is the mask, and is the sketch
We ran the optimization in the latent space for the best model identified above. 10000 steps are used during the iterative optimization. The results are demonstrated below:
From the table demonstrated above, we can see that for denser scribbles like 1,2 and 3, the quality of reconstructions is polarized. When there are similar images in the training set (like 1 and 2), the reconstruction looks realistic. However, when the dense scibble is distinct from the training set, the loss will force the latent variables to generate a very similar image like the given scribble. This means that the final found optimum is too far from the distribution of the real images in the training set. When the scribble is sparse, the optimization problem is easier since there are less constraints. But that also means that less guidance is provided to the network to fill in the details. So we can in case 4, the pose of the cat in the reconstruction is different from that in the scribble although the color constraints are mostly satisfied.
To avoid the optimizing latent vector going too far from the original distribution, we developed a better strategy expained in Bells & Whistles
In this part, we revise the algorithm in the previous part and try to make the resulting images more realistic.
Here are the results using this revised algorithm: