For this project, I focused on exploring conditional GAN. For the first part of this project, I reimplemented "cgans with projection discriminator", and run it with a few different datasets of different resolution. Please see my code submission for implementation details. For the second part of this project, I used my reimplemented network to achive image-to-image translation, show success and failure cases, and proposed potentially solutions to polish the failure cases
First, I trained with 32*32 CIFAR 10 dataset. I show results for a few different labels below
Planes
Cars
Birds
Cats
We can see that the network was able to produce images of pretty decent quality.
Next, I run it on the 64*64 standford dogs dataset, and get the following results (species 1 - 6):
Another thing that is extremely interesting is that we can actually blend between two classes by averaging the conditional batch normalization values of the corresponding classes. Here I show two blending results:
Blending species 1 with species 2:
Blending species 3 with species 4:
I also found that the network struggles to train on the tiny-imagenet dataset, producing results as this:
This is likely due to the fact that the size of the training data is too small (500) for each class, and show that our algorithm wouldn't really be able to handle the lack of training data.
One interesting thing I found is that if we are sampling with the same latent code "z", the images we get for different labels would have similar structure, as shown below:
Therefore, I came up with this idea that we can simply take an image & label pair, optimize for its z value (similar to hw 5), and change the label value to achieve unsupervised image-to-image translation. I implemented this idea, and got mixed result with it.
Queried Image:
Translated Images:
Queried Image:
Translated Images:
What happened to the failure case? Turns out that the problem is with reconstruction:
Reconstructed Image:
We can clearly see that while the reconstruction is able to get the position of the dog right, it fails to figure out the correct location of mouth and nose, and as a result produce bad quality images. In fact, as pointed out in Transforming and Projecting Images into Class-conditional Generative Networks by Huh et. al., "Purely gradient-based optimizations fail to find good solutions for projection with conditional GAN". I believe that in order to polish the result of the reconstruction, special cares needs to be taken with the projection steps. This is beyond the scope of my project, but something I belive future works could look into.
Latestly, I'd like to thank Professor Zhu and the TAs for amazing lectures and supportive feedbacks, through which I learned a lot. All the best!