In this project, we explored GANs. In the first part, we implemented a specific type of GAN designed to process images, called a Deep Convolutional GAN (DCGAN), and train the DCGAN to generate dome grumpy cats from samples of random noise. In the second part, we implemented a more complex GAN architecture called CycleGAN, which was designed for the task of image-to-image translation.
To calculate paddings (P), we need to know our input shape (IS), output shape (OS), kernel size (F) and stride (S)
OS = (IS−F+2P)/S+1
(OS-1)*S = IS-F+2P
P = [(OS-1)*S + F - IS] / 2
In our case, F = 4, S = 2, OS = 1/2 IS (except for last layer, where OS = 1/4 IS)
Therefore, we have P = 1 for all previous layers
P = 0 for last layer
The plot of basic training curve:
The plot of deluxe training curve:
Honestly, the curves looks quite similar. I think the fact that both curves drops at the start of the training while starts ocillating towards the middle of the training indicates that the GAN manages to train.
The plot of deluxe training iteration 200:
The plot of deluxe training iteration 1200:
Obviously, in iteration 200 the quality of the generated image is pretty much just noise, while in iteration 1200 we can see the shape of cats. Through the training, the generated image becomes much sharper.
The plot of naive training iteration 200:
The plot of deluxe training iteration 600:
The plot of deluxe training iteration 1200:
The plot of Cycle Consistent training iteration 200:
The plot of deluxe training iteration 600:
The plot of deluxe training iteration 1200:
The plot of naive training iteration 10000:
The plot of Cycle Consistent training iteration 10000:
We can see that the cycle consistent loss indeed help with producing visually better results of the translation. I'm actually surprised to see that even without the cycle-consistent loss, the generated images still preserves the pose, and I think that is because of our network structure that intentionally preserves some spatial information of the images.
I implemented PatchGAN discriminator and tried using it to train cycleGan. Obviously we can't use the original paper's receptive field of 70, so I instead set the receptive field to 24 (I think this makes sense on a 64*64 image). Here is the result I get:
The plot of PatchGAN training iteration 1000:
The plot of PatchGAN training iteration 10000:
I think the result is definitely sharper in terms of details. Since we don't have a global discriminator, the global quality of our translation might be slightly worse than our original result (for example, the color of the eyes are different in some of the translations).
Next, I tried a different loss function: WGAN with gradient penalty. I run it on the vanilla gan since it is easier to compare the performance there. Due to computation constraint I only run it for 100 epoches, which is the same as in part 1. The result we get is as following:
The plot of wgan-gp training iteration 1200:
I'm pretty convinced that this loss function made the generator learns faster comparing to LSGAN. As a reference, I also tried running it without the GP, and get the following results:
The plot of wgan training iteration 1200:
I'm actually suprised that pure wgan without gradient clipping or gradient penalty would even work. If someone can let me know why that is the case that would be great!
Lastly, I run my algorithm on the pokemon data, by transfering between fire and water. The results are shown below (10000 iterations):
The result doesn't look great, which is likely due to the relatively low number of iterations I'm running it. I think in order to get sharper edges and better visual generations we would want to run for more iterations, but due to time constraint I wasn't able to run it for that long.