Master of Science in Machine Learning, 2023
In this project, various techniques that involve the modification of images in the natural image manifold are explored. These techniques can be used to synthesize new images from existing ones.
Starting from noise or style latents, a delta to this noise can be trained to minimize the reconstruction loss between the generator's output and the target image.
Once the produced image is sufficiently similar to the target image, the noise with the delta provides a good latent representation of the generated image.
One key hyperparameter that can affect the quality of images generated is the relative proportion of perceptual loss and L1 reconstructuion loss.
The perceptual loss is fixed as the MSE loss between latent activations a predetermined layers, while the L1 reconstruction loss is the L1 loss of the pixel-wise differences between the generated image and its target.
Various orders of magnitude of ratios between the two losses were tested.
Hover over each image to view the proportion of losses allocated to perceptual loss.
Based on the above results, the optimal weight of perceptual loss is 0.1 and the optimal weight of L2 loss is 0.9, as this setting seems to produce the most optimal images over the set of images tested.
Another candidate loss function is a L2 Regularization loss on delta, as we do not want to generate deltas that cause the latent noise to vary too much from the original sampling distribution.
For each image, image reconstruction is performed with the addition of a L2 Regularization and without.
Hover over each image to see if regularization loss is added.
From these examples, the regularization loss does not contribute significantly to the quality of the images generated, and thus will be left out in subsequent runs.
Both Vanilla GAN and StyleGAN are also tested to see which generators are better for inversion.
Hover over each image to see the generator used for inversion.
StyleGAN seems to be the generator of choice for generating images that more closely resemble the input target.
Beyond beginning from randomly sampled noise, style latents can also be used as a base to generate the images.
The latent spaces of z, w and w+ are all used to regenerate each image using the StyleGAN generator.
Hover over each image to view which noise configuration is used for generation.
Among the latent spaces of z, w and w+, w+ seems to generate the most accurate images in terms of the color hue, cat face orientation, and similarity to the input image.
In this part, the generated images are made to look similar to a sketched target by modifying the loss function.
Given an image sketch, a mask corresponding to the active regions of this sketch is applied on the generated image and the target image, and a pixel-wise L1 loss is applied between them.
By doing so, the images generated still lie on the natural manifold but are made to follow the structure of the target sketch.
The images below show some images generated from the various sketches.
Click on the buttons to view the various images.
For the most part, the generated images bear some resemblance to the input sketch. However, the images do not seem realistic, perhaps due to the color scheme or the realism of the images drawn.
There is also one case where the image generated does not reflect the input sketch at all. This could be due to the fact that the sketch does not lie within the natural image manifold of the generator.
In this case, the generator chooses to prioritize the realism of the image, rather than trying to account for the reconstruction loss. Increasing the weight of the reconstruction loss could lead to an image that better resembles the sketch.
On average, using the w+ style latent vectors also improves the quality of the image generated, since each weight vector is optimized independently.
These images were generated from sketches that were varied in sparsity, shape and color.
The generated output responds well to different shapes, producing images of cats that closely resemble that of the sketch.
However, given a different color palette of input sketches, the produced images do not seem to look very natural. This could be problematic if the realism of the images are sensitive to the colors of the input sketch.
In this part, an input image was noised and denoised for a fixed number of steps, conditioned on a prompt which serves as a guide for the denoising process.
For each of the two images, the guidance strength and the number of steps were varied, to assess the impact of these hyperparameters on the resultant output.
Each image's associated prompt is provided, and its guidance strength and number of steps can be found by hovering over each image.
Prompt: Grumpy cat reimagined as a royal painting
Prompt: An accurate depiction of the solar system.
For both images, less than 500 noising and denoising steps led to a full reconstruction of the original image, which was expected since the image was not noised sufficiently for the image to change much.
As the number of steps increased, the level of similarity to the original image decreased, and reflected the contents of the prompt more.
The guidance strength also played an important role, with a larger guidance strenght accentuating the features of the image that better reflect the content of the prompt.
By generating images from convex combinations of latent space vectors, it is possible to generate a series of images that represent a continuous transformation from one image to the other.