Assignment #3 - When Cats meet GANs

by Zijie Li (zijieli@andrew.cmu.edu)

Overview

In this project, we will implement two kinds of generative adversarial neural networks (GANs). The first one is DCGAN (Deep Convolutional GAN), where we aim to generate realistic cat images given random sampled noise as input. The second one is CycleGAN, which will be leveraged to resolve the unpaired image-to-image translation between two categories of cat - Grumpy and Russian Blue.

Generated Cats Preview (From Course Assignment webpage)

Synthesized Grumpy cat

Part 1 Deep Convolutional GAN

Convolutional filter configuration

At each convolutional layer, we aim to downsample the spatial dimension of the input by a factor of 2. Given that the kernel size is 4, and stride equals to 2, the desired padding size can be calculated via below formula [1]: $$N=\frac{W -F + 2P}{S}+1, $$ where \(W\) represents the input volume size, \(F\) is the size of convolutional filter, \(P\) is the amount of padding, \(N\) denotes the output volume size. Here, the desired \(P\) would be :$$ P = (S \times N-W + F - S )/ 2 = (4 -2)/2 = 1 $$ The network architecture is implemented following the description in the writeup of the assignment. Instead of cross entropy loss, here we adopt mean squared error as loss function (Least-square GAN) to staibilize training process. To improve the performance of the model, below data augmentation techniques are adopted:
1. We resize the image to a new image with height and width 1.1 times the original ones.
2. Randomly crop a patch with exactly same height and width as original images.
3. Randomly do the horizontal flipping.
In the following section, I will show the loss curve comparison of not using any data augmentation and using above data augmentation strategies.

Loss curve without data augmentation

Discriminator loss on classifying fake images (D_fake_loss).

Discriminator loss on classifying real images (D_real_loss).

Generator loss on generating images that can fool Discriminator (G_loss).

Generated sample, after 200 iterations

Generated sample, after 4000 iterations

Generated sample, after 26000 iterations

Loss curve with aforementioned data augmentation

Discriminator loss on classifying fake images (D_fake_loss).

Discriminator loss on classifying real images (D_real_loss).

Generator loss on generating images that can fool Discriminator (G_loss).

Generated sample, after 200 iterations

Generated sample, after 4000 iterations

Generated sample, after 26000 iterations

It can be observed from above results that data augmentation significantly improves the quality of generated sample, as the generated images without using data augmentation is much more blurry and noisy. From the perspective of loss curve, after adding data augmentation, the Discriminator losses on classifying real images and generated images both increased, and the Generator loss is significantly reduced. The intuitive reason behind this is that after adding data augmentation, the training data's distribution become more complex, which makes the Discriminator more difficult to identify the fake from real and also alleviates model's overfitting problem. Without data augmentation, the Discriminator is prone to overfit the small dataset after several iterations and becomes too powerful for the Generator, i.e. the Generator can seldom fool Discriminator and the Generator cannot be efficiently updated.

Part 2 CycleGAN

Generated sample (X->Y), after 600 iterations

Generated sample (Y->X), after 600 iterations

The model structure here is slightly different than the previous DCGAN structure. The CycleGenerator comprises Convolutional layers, de-Convolutional layers and stacks of residual blocks. In addition, the input to the generator is not randomly sampled noise but an image (i.e. this is a conditional GAN). Concretely, we have two distributions, \(X\) and \(Y\), and we want to train two Generators, \(G_{X->Y}\) and \(G_{Y->X}\), which can generate data sample in \(Y\) domain given a input from \(X\) (and vice versa for \(G_{Y->X}\)). We will also need to train two Discriminator which help identify the image is real or fake and hence improve the quality of generated images from both Generators.
Here I adopted the same architecture as described in the writeup of the Assignment, but I add a convolutional layer with kernel size 3, stride 1 and padding 1 to the final layer of Cycle Generator as I notice that this will improve the Generator's performance a bit.

Without convolutional layer at the last layer

Generated sample (X->Y), after 200 iterations

Generated sample (Y->X), after 200 iterations

Generated sample (X->Y), after 10000 iterations

Generated sample (Y->X), after 10000 iterations

With convolutional layer at the last layer

Generated sample (X->Y), after 200 iterations

Generated sample (Y->X), after 200 iterations

Generated sample (X->Y), after 10000 iterations

Generated sample (Y->X), after 10000 iterations

Cycle Consistency

One point that distinguishes CycleGAN from other variants of GANs is the cycle consistency loss. The basic idea is that when we translate an image from domain \(X\) to domain \(Y\), and then translate the generated image back to domain \(X\), the result should look like the original image that we started with. The loss can be defined as: $$\frac{1}{m}\sum_{i=1}^m ||y^{(i)} - G_{X\to Y}(G_{Y\to X}(y^{(i)}))||_p.$$ Here we implemented the loss as mean squared error. Below are comparison of results from models not using cycle consistency and models using cycle consistency.

Generated samples without using Cycle Consistency

Generated sample (X->Y), after 1000 iterations

Generated sample (Y->X), after 1000 iterations

Generated sample (X->Y), after 1000 iterations

Generated sample (Y->X), after 1000 iterations

Generated sample (X->Y), after 5000 iterations

Generated sample (Y->X), after 5000 iterations

Generated sample (X->Y), after 10000 iterations

Generated sample (Y->X), after 10000 iterations

Generated samples using Cycle Consistency (with \(\lambda=10.0\) )

Generated sample (X->Y), after 1000 iterations

Generated sample (Y->X), after 1000 iterations

Generated sample (X->Y), after 1000 iterations

Generated sample (Y->X), after 1000 iterations

Generated sample (X->Y), after 5000 iterations

Generated sample (Y->X), after 5000 iterations

Generated sample (X->Y), after 10000 iterations

Generated sample (Y->X), after 10000 iterations

Generated samples using Cycle Consistency (with \(\lambda=50.0\) )

Generated sample (X->Y), after 1000 iterations

Generated sample (Y->X), after 1000 iterations

Generated sample (X->Y), after 1000 iterations

Generated sample (Y->X), after 1000 iterations

Generated sample (X->Y), after 5000 iterations

Generated sample (Y->X), after 5000 iterations

Generated sample (X->Y), after 10000 iterations

Generated sample (Y->X), after 10000 iterations

From above results, we can see that cycle consistency generally improve the quality of generated samples, especially at the early stage of training (e.g. samples after 1000, 2000 iterations). Here, it is also observed that larger weight \(\lambda\) of cycle consistency will make the model perform slightly worse. Without adding cycle consistency loss, there are more noticeable checkboard artifacts in the images and the images are perceptually worse in general.
The motivation of cycle consistency loss and why it improves performance is that, in the original GAN, the goal is to learn the mapping: \(X->Y\), is highly under-constrained especially when trained with unpaired data. This means, a network can map the same set of input images to any random permutation of images in the target domain, and the possibile mapping space is infinite. Adding the cycle consistency can constrain the mapping space and thus make the output of generators better match the desired output in the target domain.

Bells and Whistles

1. Differentiable data augmentation

The idea of differentiable data augmentation [2] is a quite straightforward but very effective techniques to improve the model's overall performance. The original data augmentation are usually only applied to real images in the dataset. This prevents the application of augmentation that significantly alters the distribution of the real images. Based on this, differentiable data augmentation [2] proposes following augmentation method to augment the input of the Discriminator. This forces the Discriminator to learn the underlying distribution of training data instead of memorize it.

Schematic of differentiable augmentation, \(T\) denotes a set of differentiable operator we applied to the images. (Image credit: https://github.com/mit-han-lab/data-efficient-gans)

DCGAN using Deluxe data augmentation

Generated sample, after 5000 iterations

Generated sample, after 10000 iterations

Generated sample, after 26000 iterations

DCGAN using Differentiable data augmentation

Generated sample, after 5000 iterations

Generated sample, after 10000 iterations

Generated sample, after 26000 iterations

CycleGAN using Deluxe data augmentation (Here \(Y -> X\) translation is shown as there is more noticeable improvement for this translation)

Generated sample, after 5000 iterations

Generated sample, after 10000 iterations

Generated sample, after 50000 iterations

CycleGAN using Differentiable data augmentation

Generated sample, after 5000 iterations

Generated sample, after 10000 iterations

Generated sample, after 50000 iterations

From above results, differentiable data augmentation can improve the generated samples quality. Such improvement is significant when training data is vert limited.

2. Probablistic Generative Model

2.1 PixelCNN

The core idea of PixelCNN [3] is modelling the image generation problem as a pixel probability distribution problem. Given a pixel \(i\), the probility distribution of its pixel value (R, G, B value) is conditioned on seen pixels in the previous rows and columns:

(Source: https://arxiv.org/abs/1601.06759)

To implement this "conditioned on" concept, we can implement a masked Convolutional filter whose receptive field is contrained to pixels that have seen, for those pixels we haven't seen, filter values are masked to zeros. According to the article, there is two kinds of mask. The first kind of mask, mask A, defines the pixel we are currently dealing with, as unseen pixel, and thus its value has to be masked as zero when doing convolution. The second kind of mask, mask B, defines the current pixel as a seen pixel, which means its value will not be masked.


                                                                                            Mask
                                                                            -------------------------------------
                                                                            |  1       1       1       1       1 |
                                                                            |  1       1       1       1       1 |
                                                                            |  1       1    1 if B     0       0 |   H // 2
                                                                            |  0       0       0       0       0 |   H // 2 + 1
                                                                            |  0       0       0       0       0 |
                                                                            -------------------------------------
                                                                            0       1     W//2    W//2+1
            

Following the description in the original paper, I implement the first convolutional layer with mask type A (kernel size 7, padding 3, stride 1), followed by 15 layer of residual blocks, the inner structure of residual block is shown as below. After residual blocks, there are two 1x1 convolutional layer at the final layers. The pixel generation problem can be reformulated as a 256-class (or other class number depending on the pixel space size) classification problem, which means, the model will output probabilities for pixel's value from 0 to 255. The loss function is Cross Entropy Loss.

Residual block in pixelCNN (Source: https://arxiv.org/abs/1601.06759)

Pixel Generation result for Cat dataset

In the training time, the model is trained in parallel, but in the sampling stage, pixels are generated sequentially. Here I trained the model with 2000 epochs, however, given the dataset is too small and PixelCNN is a quite complicated model (According to OpenAI's Github repo of PixelCNN, the training on CIFAR10 dataset took over a week on 8 TITANX GPUs), the model's performance is relatively bad. I tried to look for pretrained model weights on CIFAR 10 and transferred it to the Cat dataset. Unfortunately there is no any open-source PyTorch pretrained PixelCNN model available:(. To verify the correctness of my implementation, I also trained the model on binary MNIST dataset and this time the result looks reasonable. In general, I think PixelCNN is a powerful probabilistic model but very difficult to train given its complexity.

Some results from PixelCNN

(For cat PixelCNN, it can generate some patterns of Grumpy Cat, but in general the image is perceptually distorted.)

Generated sample, after 200 iterations

Generated sample, after 3000 iterations

Generated sample, after 14000 iterations

PixelCNN trained on MNIST, generated sample after 100K iterations, which looks more reasonable

2.1 Variational AutoEncoder (VAE)

A VAE contains a encoder that predicts the latent distribution \(z\) given input sample \(x\), and a decoder that decodes random noise sampled from latent space back to \(x\). It is considered as probabilistic generative model as the distribution in the latent space is known (given input). Hence we can sample in the latent space and then use decoder to generate output. Here I implemented the encoder as a convolutional neural network, and decoder as a deconvolutional (Transposed Convolution) neural network. I use standard VAE loss as loss function (Mean Squared Loss plus prior loss based on KL divergence).

Some results from VAE

In general the result looks reasonable but it is much more blurry compared to DCGAN. (which is a common drawback of vanilla VAE)

Generated sample, after 400 iterations

Generated sample, after 3200 iterations

Generated sample, after 6000 iterations

Generated sample, after 14000 iterations

References

[1] https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning

[2] https://arxiv.org/pdf/2006.10738.pdf

[3] https://arxiv.org/abs/1601.06759