Blending StyleGAN2 models to turn faces into cats(and more)
Project Website for 16726 - Learning Based Image synthesis
CMU Spring 2021
Tarang Shah(tarangs)
Rohan Rao(rgrao)
TOC
Goal of the project
- Our project looks at how we can make artistic edits to images through model blending.
- The models described here are StyleGAN2 models that include one pretrained model(Faces) and the others were trained using transfer learning on their respective datasets.
- We use this model to transfer characteristics from cats/dogs/cartoons/wild animals onto faces of the people we know 😉
Projector upgrades to StyleGAN2
The original code uses a single weight vector using StyleGAN's internal mapping
network - . However, inspired from HW5, we modify this to use a collection of weight vectors instead - (this is technically a latent "tensor" but we refer to it as a latent vector for brevity)
We also tried adding a Mean Squared Error (MSE) loss, but we noted that this MSE loss was actually making the resulting images smoother and less photorealistic, which reduced the perceptual quality. We also tried varying the noise regularization parameter, but this too did not result in any significant changes in the generated outputs.
In the future, we would like to add a feed-forward neural network to provide a quick one-shot initialization for the optimization.
Training StyleGAN2
- Start with a Pretrained FFHQ based face generator model - Here we used the 256x256 version for faster experimentation (From NVIDIA). We also used the version with the latest ADA improvements, which allow us to train a StyleGAN model with significantly fewer images.
- Fine tuned the model on 4 datasets - AFHQ Cats, AFHQ Dogs, AFHQ Wild Animals and Google Cartoons
Inverting the Generator
Background
The core idea here is to invert an image generator, particularly the StyleGAN2 generator. Before we get into details of the inversion and how we do it, lets first understand what the generator does. An image generator usually takes a vector as input and generates an image. Essentially it is a black box which takes a vector and returns an image.
Vector → [Generator] → Image
We can see the visualization of the generator model from our "Vanilla GAN" of Homework 3 below
Although we show the Generator from a simple GAN, it is possible to use any generator. For the purposes of this assignment, we use the very famous and popular StyleGAN2 generator.
The task
Now that we have seen what a generator does, lets talk about our task. Our first task is to generate a vector from a given input image. This is literally the opposite of what the generator does 🙃.
The vector we want from a given image is also known as a latent vector, since it belongs to the "latent space" of the generator.
Given an input image, the goal is to find a latent vector that produces the input image when we pass it through the generator.
Input Image → [ ?? ] → Latent Vector → [Generator] → Generated Image
Our goal is to figure out the "??" in the above process, such that the Generated Image is as similar to the input image as possible. Since the process is the reverse of what the Generator does, we call this "inverting" the generator.
We use optimization techniques for achieving this inversion. We don't actually use a model to replace the "??" in the above image, but we use some math and optimization techniques to achieve the results we want.
Steps followed
- We start with a random Latent Vector and then pass it through the Generator.
- The Generator is in
eval
mode so it is only use for a forward pass
- Since we want the resultant image from the Generator to be as close to the real image as possible, we need to build a loss function for the same
- We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
- (here is also called perceptual weight or
perc_wgt
)
- The Perceptual Loss here is the "Content Loss" at
conv_4
of a VGG network, as described here
- (here is also called perceptual weight or
- The loss is calculated between the resultant image and the input real image
-
- We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
- We use an ADAM optimizer on this loss to optimize the input Latent Vector (changing from the LBFGS optimizer in HW5)
- Finally after about 500 iterations, we can use the resultant vector as an optimized vector
Model Blending using StyleGAN2
Background and Task Description
Here, we would like to blend two trained StyleGAN2 models together. The first model, called the base model, will be trained on a particular dataset, like FFHQ. We already have pre-trained FFHQ models available, thanks to NVIDIA. These models are then fine-tuned on specific datasets, like the AFHQ-Cats, AFHQ-Dogs, AFHQ-Wild Animals and Google Cartoons type of datasets.
We then use two trained models and swap out the corresponding layers using either binary or fractional blending techniques. This is described in the figure below.
As shown above, we can create a model that uses some weights from the base model and some weights from the blended model and then use this to generate new and interesting results. Here we have two options - either switch between the two sets of weights abruptly, or use a fractional linear combination to smoothly transition between them.
When the blending layer is BK, For each block above K, we do the below operation for each layer,
For the simple case,
Alternatively, we also use a progressive value,
The main configurable parameter for the fractional is the q value. We found the best results at q=0.7
Where,
= Weights of the resultant blended model
= Weights from the base model layer
= Weights from the second model layers
= Weight factor
k = block after which we start blending
m = the index of the block starting at 0 for the kth block(for example, if K is 8, B16 has m=1, B32 has m=2)
q = Blend Width
Note: the second model MUST BE transfer-learned from the base model, for best results. Here we show the difference with and without transfer learning:
From this experiment, it seems that when models are transfer learned, many features in the early stage of the models are related. Hence when we blend or even swap layers across these models, we get to see interesting combinations.
Blending Model Weights vs Latent Space Interpolation
Latent Space v/s Blending Weights
Name | Latent Space Interpolation | Blending Weights |
---|---|---|
Type of Operation | Single model operation | Multiple models - Humans, Cats, Dogs, Wild Animals, Cartoons! |
Number of operations | Allows for multiple operations within the same model | One time operation - blend the weights and save the weights |
Ease of usage with new inputs | Need to find a latent vector for each input (requires optimization or forward pass through a CNN) | If we have the base model latent vectors, they do not need to be recalculated per-model |
How it works | Need to tweak the latent vector to obtain the required styles | Easier to apply the same style on a variety of inputs |
Experiments and Interesting Results
Simple Blending Experiments (with random seeds)
In the first row we can see we have the Dog generator as the base model and the face generator as the blended model. As we blend at higher and higher layers, we see less of the blended model(faces).
We also see some interesting things happening here:
- The first row B4 column shows that the FFHQ network used the corresponding latent vector to generate a person who has a very similar face pose, background and angle.
- For the rest of the rows, B16 and B32 generate the most exciting blends that form the "uncanny valley" between the human faces and cat faces.
More Interesting Blending Experiments
The middle section includes the best layer. Between B16 and B64, we clearly see that its a mix of the 2 models, but the cuteness is still preserved 😍
High-level Procedure
Results on custom images
Github Repository
The code for this repository is available here - https://github.com/t27/stylegan2-blending
We started from the base StyleGAN2 ADA - Pytorch repository and made various changes including improving the existing code and also adding our own code.
Links to our pretrained models and instructions to set up and run our code are also provided in the README of the repository.
We also include the source code for the webapp demo described below.
Demo Website
We have a demo website for this project available. We used the Streamlit library to develop the website.
The demo website is available here https://t27.pagekite.me/
Since the website runs on our local machine, we aim to keep it live upto 25th May.
Here is a video walkthrough of the website - https://youtube.com/watch?v=Urr-bbI10DQ
Note for the instructors and TAs, if there are any issues or problems with the website, please do contact us and we will try our best to fix and ensure the link is up and running.
Applications
- Can be used for sim2real transfer - allow us to preserve specific features from different spaces (low/high level, depending on the blend layers)
- Can be used for artistic applications, including generation of caricatures, realistic (uncanny) avatars for games, and Animoji (Apple)
Next Steps
- More experiments on fractional blending - To explore how much of a control we can have on the blending results and if it is possible to edit finer features in the output images
- Exploring newer creative results - Interleaved or randomized blending instead of sequential blending after a Kth block
- Using a feed forward network for the Latent space projection so that the Image to Image style transfer can be done in a single pass
- Combining Latent Space editing and Weight blending for building more creative tools for inspiring artists
References
- Karras et. al, Analyzing and Improving the Image Quality of StyleGAN
- Pinkney et. al, Resolution Dependent GAN Interpolation for Controllable Image Synthesis Between Domains
- Abdal et. al, Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?