When Cats meet GANs

Sudeep Dasari

Andrew ID: sdasari


Overview

In this project, we build algorithms that generate (grumpy) cat images! Specifically, DCGAN networks are trained to convert random vectors into cat pictures, while CycleGANs are used to convert pictures of the original grumpy cat into those of another grumpy feline friend (or vice-versa).

Part 1: Deep Convolutional GAN

Here we train a DCGAN on a small (~200 image) grumpy cat dataset - shown below. In order to prevent overfitting, data augmentation is added into the loading pipeline.

Real Data Samples

1.1 Solving for Padding

Note that the following relationship holds between image output size \(V\), and input size \(W\), padding \(P\), and stride \(S\). $$ V =\frac{W - K + 2P}{S} + 1 $$ Re-arranging, setting \(V = 0.5 W\), and plugging in constants reveals \(P=1\) for all layers except the last where \(P=0\).

1.2 Training with and w/out Augmentation

Loss Default (Blue) vs Augmentation (Orange)
Discriminator FAKE Loss Discriminator REAL Loss Generator Loss

After training DCGAN with and without data augmentation I plot the training loss over time for each configuration. A trained GAN should have loss values fluctuating around 0.1 - 0.9 on averange. Note that both models roughly reach this, but the model without augmentation (blue) acheives a much lower discriminator loss than the one with augmentation. As a result, the generator for the model without augmentation (blue) is far less capable, as shown by the higher generator train loss. This a common problem with GANs where the discriminator becomes too strong and thus overwhelms the generator.

1.3 Results w/ Augmentation

I now train the deluxe model for 10000 iterations and visualize examples from various stages of training. At 200 iterations the results are basically noise, but by 1000 iterations some cat-like "average" features are clearly visible. Around iteration 10000, the images look like grumpy cat faces in various poses. However, obvious image artifacts (graininess, weird proportions, etc.) remain.

It's notable that the fully trained model has obvious evidence of "mode collapse." Note that the actual data has pictures of grumpy cat with different colored eyes and in more diverse poses. The generated samples only copy the most common grumpy cat in more limited pose set.

Deluxe Training Results
Iteration 200 Iteration 1000 Iteration 13000

CycleGAN

I now train a CycleGAN model to convert images of the OG grumpy cat (domain \(Y\)) into a grey-furred variant (domain \(X\)). Two versions of CycleGAN are trained - one with the full implementation and the other without cycle-consistency loss. The results are surprisingly similar, but I do notice much better pose correspondence between images generated with the consistency loss. In other words, that loss helps align the cat poses between generations, by forcing the generator to map samples back to themselves. Furthermore, the added cycle loss helps the generator make better images by better matching the average pixel image pixel values and getting rid of "blotchy" artifacts.

W/ Cycle Iter 600
X to Y Y to X
W/ Cycle Iter 10000
X to Y Y to X
No Cycle Iter 600
X to Y Y to X
No Cycle Iter 10000
X to Y Y to X

Bells and Whistles - Patch Discriminator

Finally, I implement the patch discriminator bell and whistle. As the name suggests, the discriminator on images in CycleGAN is replaced with discriminator on patches of the image. This can be efficiently implemented via convolution by dropping last 2 layers of network.

The patch discriminator makes a huge positive difference in generation quality by forcing the generator to correctly match textures between domains while ignoring larger geometry! This results in more realistic cat generations that don't change face shape in unrealistic ways as before. However, this obviously won't work in situations where transforming geometry is desired.

Patch CycleGAN Iter 10000
X to Y Y to X

Website template graciously stolen from here