Project 5

In this project, various techniques that involve the modification of images in the natural image manifold are explored. These techniques can be used to synthesize new images from existing ones.

Firstly, an image generator will be inverted, such that a meaningful latent vector can be obtained for any input image.
Then, given a sketch of a target image, the produced image can be manipulated to look like this sketch by minimizing a masked loss between the sketch and the generated image.
Lastly, stable diffusion can be employed to generate an image based on a conditioned prompt by noising and denoising an image for a fixed number of steps.

Generator Inversion

Starting from noise or style latents, a delta to this noise can be trained to minimize the reconstruction loss between the generator's output and the target image.

Once the produced image is sufficiently similar to the target image, the noise with the delta provides a good latent representation of the generated image.

Varying Proportion of Perceptual Loss and Lp Loss in Pixel Space

One key hyperparameter that can affect the quality of images generated is the relative proportion of perceptual loss and L1 reconstructuion loss.

The perceptual loss is fixed as the MSE loss between latent activations a predetermined layers, while the L1 reconstruction loss is the L1 loss of the pixel-wise differences between the generated image and its target.

Hover over each image to view the proportion of losses allocated to perceptual loss.

Original

Perc Loss: 0.001 L2 Loss: 0.999

Perc Loss: 0.01 L2 Loss: 0.99

Perc Loss: 0.1 L2 Loss: 0.9

Perc Loss: 0.5 L2 Loss: 0.5

Perc Loss: 0.9 L2 Loss: 0.1

Perc Loss: 0.99 L2 Loss: 0.01

Perc Loss: 0.999 L2 Loss: 0.001

Original

Perc Loss: 0.001 L2 Loss: 0.999

Perc Loss: 0.01 L2 Loss: 0.99

Perc Loss: 0.1 L2 Loss: 0.9

Perc Loss: 0.5 L2 Loss: 0.5

Perc Loss: 0.9 L2 Loss: 0.1

Perc Loss: 0.99 L2 Loss: 0.01

Perc Loss: 0.999 L2 Loss: 0.001

Original

Perc Loss: 0.001 L2 Loss: 0.999

Perc Loss: 0.01 L2 Loss: 0.99

Perc Loss: 0.1 L2 Loss: 0.9

Perc Loss: 0.5 L2 Loss: 0.5

Perc Loss: 0.9 L2 Loss: 0.1

Perc Loss: 0.99 L2 Loss: 0.01

Perc Loss: 0.999 L2 Loss: 0.001

Original

Perc Loss: 0.001 L2 Loss: 0.999

Perc Loss: 0.01 L2 Loss: 0.99

Perc Loss: 0.1 L2 Loss: 0.9

Perc Loss: 0.5 L2 Loss: 0.5

Perc Loss: 0.9 L2 Loss: 0.1

Perc Loss: 0.99 L2 Loss: 0.01

Perc Loss: 0.999 L2 Loss: 0.001

Original

Perc Loss: 0.001 L2 Loss: 0.999

Perc Loss: 0.01 L2 Loss: 0.99

Perc Loss: 0.1 L2 Loss: 0.9

Perc Loss: 0.5 L2 Loss: 0.5

Perc Loss: 0.9 L2 Loss: 0.1

Perc Loss: 0.99 L2 Loss: 0.01

Perc Loss: 0.999 L2 Loss: 0.001

Original

Perc Loss: 0.001 L2 Loss: 0.999

Perc Loss: 0.01 L2 Loss: 0.99

Perc Loss: 0.1 L2 Loss: 0.9

Perc Loss: 0.5 L2 Loss: 0.5

Perc Loss: 0.9 L2 Loss: 0.1

Perc Loss: 0.99 L2 Loss: 0.01

Perc Loss: 0.999 L2 Loss: 0.001

Based on the above results, the optimal weight of perceptual loss is 0.1 and the optimal weight of L2 loss is 0.9, as this setting seems to produce the most optimal images over the set of images tested.

Adding L2 Regularization to Delta

Another candidate loss function is a L2 Regularization loss on delta, as we do not want to generate deltas that cause the latent noise to vary too much from the original sampling distribution.

For each image, image reconstruction is performed with the addition of a L2 Regularization and without.

Original

Without L2 Regularization

With L2 Regularization

Original

Without L2 Regularization

With L2 Regularization

Original

Without L2 Regularization

With L2 Regularization

Original

Without L2 Regularization

With L2 Regularization

Original

Without L2 Regularization

With L2 Regularization

Original

Without L2 Regularization

With L2 Regularization

From these examples, the regularization loss does not contribute significantly to the quality of the images generated, and thus will be left out in subsequent runs.

Vanilla GAN and StyleGAN

Both Vanilla GAN and StyleGAN are also tested to see which generators are better for inversion.

StyleGAN seems to be the generator of choice for generating images that more closely resemble the input target.

Latent Spaces z, w and w+

Beyond beginning from randomly sampled noise, style latents can also be used as a base to generate the images.

The latent spaces of z, w and w+ are all used to regenerate each image using the StyleGAN generator.

Original

Latent Space: z

Latent Space: w

Latent Space: w+

Original

Latent Space: z

Latent Space: w

Latent Space: w+

Original

Latent Space: z

Latent Space: w

Latent Space: w+

Original

Latent Space: z

Latent Space: w

Latent Space: w+

Original

Latent Space: z

Latent Space: w

Latent Space: w+

Original

Latent Space: z

Latent Space: w

Latent Space: w+

Among the latent spaces of z, w and w+, w+ seems to generate the most accurate images in terms of the color hue, cat face orientation, and similarity to the input image.

Scribble to Image

Cat Sketches

In this part, the generated images are made to look similar to a sketched target by modifying the loss function.

Given an image sketch, a mask corresponding to the active regions of this sketch is applied on the generated image and the target image, and a pixel-wise L1 loss is applied between them.

By doing so, the images generated still lie on the natural manifold but are made to follow the structure of the target sketch.

For the most part, the generated images bear some resemblance to the input sketch. However, the images do not seem realistic, perhaps due to the color scheme or the realism of the images drawn.

There is also one case where the image generated does not reflect the input sketch at all. This could be due to the fact that the sketch does not lie within the natural image manifold of the generator.

In this case, the generator chooses to prioritize the realism of the image, rather than trying to account for the reconstruction loss. Increasing the weight of the reconstruction loss could lead to an image that better resembles the sketch.

On average, using the w+ style latent vectors also improves the quality of the image generated, since each weight vector is optimized independently.

More Cat Sketches

These images were generated from sketches that were varied in sparsity, shape and color.

The generated output responds well to different shapes, producing images of cats that closely resemble that of the sketch.

However, given a different color palette of input sketches, the produced images do not seem to look very natural. This could be problematic if the realism of the images are sensitive to the colors of the input sketch.

Stable Diffusion

In this part, an input image was noised and denoised for a fixed number of steps, conditioned on a prompt which serves as a guide for the denoising process.

For each of the two images, the guidance strength and the number of steps were varied, to assess the impact of these hyperparameters on the resultant output.

Each image's associated prompt is provided, and its guidance strength and number of steps can be found by hovering over each image.

Drawing of a Grumpy Cat

Strength 5
Noise Amount 100

Strength 15
Noise Amount 100

Strength 50
Noise Amount 100

Strength 5
Noise Amount 300

Strength 15
Noise Amount 300

Strength 50
Noise Amount 300

Strength 5
Noise Amount 500

Strength 15
Noise Amount 500

Strength 50
Noise Amount 500

Strength 5
Noise Amount 700

Strength 15
Noise Amount 700

Strength 50
Noise Amount 700

Strength 5
Noise Amount 900

Strength 15
Noise Amount 900

Strength 50
Noise Amount 900

Kids Drawing of Shapes

Strength 5
Noise Amount 100

Strength 15
Noise Amount 100

Strength 50
Noise Amount 100

Strength 5
Noise Amount 300

Strength 15
Noise Amount 300

Strength 50
Noise Amount 300

Strength 5
Noise Amount 500

Strength 15
Noise Amount 500

Strength 50
Noise Amount 500

Strength 5
Noise Amount 700

Strength 15
Noise Amount 700

Strength 50
Noise Amount 700

Strength 5
Noise Amount 900

Strength 15
Noise Amount 900

Strength 50
Noise Amount 900

For both images, less than 500 noising and denoising steps led to a full reconstruction of the original image, which was expected since the image was not noised sufficiently for the image to change much.

As the number of steps increased, the level of similarity to the original image decreased, and reflected the contents of the prompt more.

The guidance strength also played an important role, with a larger guidance strenght accentuating the features of the image that better reflect the content of the prompt.

Bells and Whistles

Latent Space Interpolation

By generating images from convex combinations of latent space vectors, it is possible to generate a series of images that represent a continuous transformation from one image to the other.