11-747 Learning-based Image Synthesis Manuel Rodriguez Ladron de Guevara
This assignment explores GANs on grumpy cats (and Pokemons). Firstly, we learn to generate grumpy cats from noise
implementing a DCGAN model with L2 loss (LSGAN). In the second part, we implement CycleGAN to learn to map between
two unpaired sets of images, in our cats context, sets of grumpy and russian blue cats. In addition, explore
different datasets of different size and implement different tricks that improve the quality of our results.
Deep Convolutional GAN (DCGAN) was introduced by Radford et al. and uses convolutional neural network in the discriminator, and transposed convolutions in the generator. Specifically, we use the following architecture:
We train our DCGAN architecture using L2 loss, which helps stabilize the training with respect to the original GAN loss. We implement the training following the pseudocode below:
We run the training using basic transformation, ie, resizing and normalizing, and deluxe transformation, which includes resizing, random crop, random horizontal flip and normalization. We contrast these baseline results with some additions such as spectral normalization, patch discriminator and differentiable data augmentation.
The next to images show the DCGAN trained with the bare minimum of parameters. That is, trained for just 500 epochs, batch size 16, learning rate 0.0003, and basic and deluxe data augmentations.
Basic augmentation shows at this early stage better results than deluxe augmentation, however, the losses between discriminator and generator seem to diverge in the basic augmentation, while for the deluxe augmentation, they seem to get closer to each other. To ensure having some learning signal for G and D during training, the Generator loss should keep its values in between 0.5 and 2 (sometimes higher than 2, depending on the loss function). The D loss should not reach 0, otherwise there is no feedback for the generator to improve.
The Pokemon dataset is way more difficult to train than the cats, and this is due to the diversity of the shapes and colors of pokemons. I achieved decent results after a lot of exploration and training runs. Tricks implemented:
After a lot of effort, I could control the stabilization of the training and the quality of the outputs with the following hyperparameters: batch size 64, generator learning rate 0.0001, discriminator learning rate 0.0004, one-sided label smoothing uniformly sampled between 0.8 and 0.9, spectral normalization in convolutional layers, deluxe augmentation and 2-staged differentiable augmentation, the first 60000 iterations I only used cutout, and the second 80000 iterations I used cutout and translation.
Once I got to work the pokemon dataset, I added a couple of layers to the generator and discriminator for the new size, cats at 256x256. I maintained the same parameters as for the pokemon dataset. One big difference I noticed when augmenting the resolution is the loss fluctuation. Here is where Spectral Normalization comes to rescue, there is quite a difference between using it and not using it!
Cycle GAN was introduced by Zhu et al. as a more flexible option to image-to-image translation models, based on the famous pix2pix work by Isola et al. While pix2pix needs paired inputs, that is, each image has to be paired with the corresponding translation for the model to generalize to unseen images, CycleGAN does not need labelled data (paired images). It achieves such translation between 2 sets using a cycle consistency loss, which encourages that the translation from A to B to A to be the as close as possible to the original input A. The generator instead of using as input some noise Z, uses images from one of the sets, encodes then through a series of convolutional layers and residual blocks, and decodes using standard tranposed convolutions. The discriminator can be a regular DCGAN discriminator. However, we show how PatchGAN discriminator, initially introduced in the pix2pix model, generates much better results than the standard DCGAN discriminator.
We train our DCGAN architecture using L2 loss, which helps stabilize the training with respect to the original GAN loss. We implement the training following the pseudocode below:
We run the training using basic transformation, ie, resizing and normalizing, and deluxe transformation, which includes resizing, random crop, random horizontal flip and normalization. We contrast these baseline results with some additions such as spectral normalization, patch discriminator and differentiable data augmentation.
We show some results on both generators (X to Y and Y to X) using the cats dataset:
We see how the Translation from Blue to Grumpy is better. This is due to the difference in size between the 2 datasets, while Grumpy has 205 images, Blue has only 76. This naturally translates to a difficult generation on the latter case.
The major improvement that we can see here is in the Grumpy to Blue cat. While generating Grumpy cats do not improve significantly, cycle-consistency loss greatly improves the generation of Blue cats. However, results are still far away from desired. To alleviate this, let's look at the results of doing data augmentation deluxe.
We have similar results from baseline 2, grumpy cats do not improve greatly, but the data augmentation really pays off in the blue cat.
In this last baseline, we see how cycle consistency loss improve greatly the generation of the grumpy cats, being this baseline the best so far. On the other hand, sadly, Blue cats not only do not improve from Baseline 3, but they get worse.
PatchGAN discriminator makes the difference! Finally we see good results for the grumpy cats and the best Blue cat generated so far. PatchGAN discriminator deleted some yellow artifacts we have seen previously around the eyes of the Bue cats.
The cycle loss again does not add much in the grumpy cats nor the blue cats. Arguably, grumpy cats have similar qualities and Blue cats are definitely worse.