By Flora Cheng
16-726 Learning-Based Image Synthesis (Spring 2025)
In this project, a few different techniques that manipulate images on the manifold of natural images are implemented. Our goal was to get a generated image based on an input sketch, and this was done using GAN models as well as diffusion models.
First, this project inverts a pre-trained generator to find a latent variable that closely reconstructs the given real image.
A latent variable is a lower dimension representation of an image, and using a latent variable, a generator may reconstruct an image. The goal is to find a latent variable that results in a reconstruction that resembles an original image. The loss between a target and generated image is found using perceptual loss.
We also considered the latent spaces used in StyleGAN2: w and w+. These spaces can help reconstruct the input image more easily, with the risk of overfitting.
Changing Latent Space | z | w | w+ | |
---|---|---|---|---|
Time when ran on stylegan model with other default parameters |
real 0m30.954s user 0m28.613s sys 0m3.194s |
real 0m30.689s user 0m28.424s sys 0m3.120s |
real 0m31.095s user 0m28.708s sys 0m3.233s |
|
Outputs when ran on stylegan model with other default parameters |
![]() |
![]() |
![]() |
![]() Target Image |
Changing Model | stylegan | vanillagan | ||
Time when ran with other default parameters |
real 0m30.954s user 0m28.613s sys 0m3.194s |
real 0m11.681s user 0m9.731s sys 0m2.786s |
||
Outputs when ran on with other default parameters |
![]() |
![]() |
![]() Target Image |
|
Changing perceptual weight | .001 | .01 (default) | .1 | 1 |
Time when ran on stylgan with other default parameters |
real 0m31.148s user 0m28.928s sys 0m3.034s |
real 0m30.954s user 0m28.613s sys 0m3.194s |
real 0m31.098s user 0m28.814s sys 0m3.159s |
real 0m31.379s user 0m29.148s sys 0m3.049s |
Outputs when ran on stylegan model with other default parameters (z latent) |
![]() |
![]() |
![]() |
![]() |
Outputs when ran on stylegan model with w+ latent |
![]() |
![]() |
![]() |
![]() |
Running with different configurations, there wasn't any notable differernces in runtime except for when swapping out the model, where vanillagan was significantly faster than stylegan.
We can see that having changes in the latent space to w and w+ seems to give results that align more closely to the target image. This is because w and w+ are based off of z (which is a randomized vector) but aren't restricted to the same distribution.
We can also see how changing the model also impacted the results, where stylegan gives a clearer result, but the vanilla gan is more closely aligned to the target image. This may be because the stylegan comes with pre-made weights, which already orient it a certain way, and the vanilla gan can be more finetuned to a specific target image.
Finally we can see by changing the perceptual weight, when the weight is higher, it seems to be less aligned.
Next, the project takes in a hand-drawn sketch and generate an image that fits the sketch accordingly.
This was done by utilizing a combination of perceptual and l1 loss, such that the output image also has some alignment with the sketch. Using perceptual and l1 weights, the loss was found as the sum of the l1 loss of the masked (sketched) region and the perceptual loss.
Below this was tested on sketches that varied in line thicknesses and on different latents with the remaining parameters as default.
Drawing\Latent | z | w | w+ |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
For the less defined (just lines) sketches, results using w and w+ latent space are more recognizable, however they look quite similar and don't really resemble much of the sketch. When the sketch has more defined/filled in areas, the results using w and w+ latent space are less cohesive/recognizable. However the results using the z latent space seems a bit better/more aligned.
Finally, we will generate images based on an input image and a prompt "Grumpy cat reimagined as a royal painting" using stable diffusion.
Parameters \ Image | ![]() |
![]() |
---|---|---|
15 Guidance Strength 300 Timesteps |
![]() |
![]() |
15 Guidance Strength 500 Timesteps |
![]() |
![]() |
15 Guidance Strength 700 Timesteps |
![]() |
![]() |
15 Guidance Strength 999 Timesteps |
![]() |
![]() |
5 Guidance Strength 700 Timesteps |
![]() |
![]() |
25 Guidance Strength 700 Timesteps |
![]() |
![]() |
So we can see as we increase the amount of noise used (increasing the timestep), the less the output image aligns with the input image, to the point that the last image has no resemblance to the starting image. Increasing the guidance weight should results in images that aligns a bit more with the prompt. This isn't as immediately obvious, but the higher weight images are resemble grumpy cats a bit more compared to the lower weights.
Because we found latnent vectors that may correspond to a specific image generated by the generator, given 2 target images, we are able to interpolate between two latent codes in the GAN model, and generate an image sequence that smoothly translates from one image to another.
Latent | Original Starting Image | Projected Starting Image | Interpolated GIF | Projected Ending Image | Original Ending Image |
---|---|---|---|---|---|
z | ![]() |
![]() |
![]() |
![]() |
![]() |
z | ![]() |
![]() |
![]() |
![]() |
![]() |
w | ![]() |
![]() |
![]() |
![]() |
![]() |
w | ![]() |
![]() |
![]() |
![]() |
![]() |
w+ | ![]() |
![]() |
![]() |
![]() |
![]() |
w+ | ![]() |
![]() |
![]() |
![]() |
![]() |