Introduction

In this assignment, we implement two types of generative adversarial networks (GANs) designed for generating cat and Pokemon images. One of them generates samples from purely random noise (Deep Convolutional GAN or DCGAN), whereas the other can transform one type of image into another (CycleGAN). We’ll train the CycleGAN to convert between different types of two kinds of cats (Grumpy and Russian Blue).

Part 1: Deep Convolutional GAN

For the first part of this assignment, we implement a Deep Convolutional GAN (DCGAN). A DCGAN is simply a GAN that uses a convolutional neural network as the discriminator, and a network composed of transposed convolutions as the generator. To implement the DCGAN, we need to specify three things: 1) the generator, 2) the discriminator, and 3) the training procedure. We will develop each of these three components in the following subsections.

Implement Data Augmentation

Since we have a very small dataset, we need to perform some data augmentation to prevent overfitting of the discriminator. For this purpose I have scaled up the image by 10% and performed random crops to the correct size, random horizonal flips, and random rotations of up to 10 degrees. This is then transformed into a tensor for using with the CustomDataset in PyTorch.

   elif opts.data_aug == 'deluxe':
        load_size = int(1.1 * opts.image_size)
        osize = [load_size, load_size]
        transform = transforms.Compose([
            transforms.Resize(osize, Image.BICUBIC),
            transforms.RandomCrop(opts.image_size),
            transforms.RandomHorizontalFlip(),
            transforms.RandomRotation(10),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
        ])

Implementing the Discriminator of the DCGAN

The discriminator in this DCGAN is a convolutional neural network that has the following architecture:

Padding: In each of the convolutional layers shown above, we downsample the spatial dimension of the input volume by a factor of 2. Given that we use kernel size K = 4 and stride S = 2, input width of Wi and output width of Wo, the padding can be calculated by using the standard formula (Wi-K+2*P)/S + 1 = Wo. Here we have Wi = 64, K = 4, S = 2, Wo = 32. This gives P = 1.

Generator

The generator of the DCGAN consists of a sequence of transpose convolutional layers that progressively upsample the input noise sample to generate a fake image. The generator we use in this DCGAN has the following architecture:

Training Loop

The training loop is implemented as shown in the following pseudocode, similar to a standard GAN. The implementation uses torch.mean, torch.sum and torch.square for the various mathematical terms.

Experiment

  1. The DCGAN can be trained with the command:
    python vanilla_gan.py --num_epochs=100 
    

    The results can be visualized through Tensorboard, including the generator and discriminator losses. These screenshots are shown below, with some sample results!

    • Screenshots of discriminator and generator training loss with both --data_aug=basic/deluxe – 4 curves in total, are shown below. If the GAN trains successfully, both losses reduce up to a certain value and then oscillate around it. The reason for this oscillation is due to the fact that the generator and discriminator are competing against each other due to the minimax formulation of the GAN, and so improving one means the other has a higher loss, and vice versa. Usually it will still converge to a fixed value because the model is able to find an optimum within the loss landscape. A lot of great information and suggestions are available here: https://stackoverflow.com/questions/42690721/how-to-interpret-the-discriminators-loss-and-the-generators-loss-in-generative.
    • With deluxe augmentation, here's a set of samples from iteration 200 and iteration 11200.
      It can be seen that what starts off as completely random noise eventually improves to form images that are very close to something that may be obtained in the training data. The samples improve during training by first focusing on the correct colors and broad high-level features like the edges, then become more specific and add details to the cat face and eyes.

Part 2: CycleGAN

Data Augmentation

The same sort of augmentation as the DCGAN is used here too.

Generator

The generator in the CycleGAN has layers that implement three stages of computation: 1) the first stage encodes the input via a series of convolutional layers that extract the image features; 2) the second stage then transforms the features by passing them through one or more residual blocks; and 3) the third stage decodes the transformed features using a series of transposed convolutional layers, to build an output image of the same size as the input. The residual block used in the transformation stage consists of a convolutional layer, where the input is added to the output of the convolution. This is done so that the characteristics of the output image (e.g., the shapes of objects) do not differ too much from the input.

Although there are two generators in the CycleGAN model, corresponding to the two directions X->Y and Y->X, the implementations and architectures are identical. So the two generators are simply different instantiations of the same class.

CycleGAN Training Loop

Finally, we implement the CycleGAN training procedure, pseudocode for which is shown below.

It is similar to Part 1, but there is a lot of symmetry in the training procedure, because all operations are done for both X → Y and Y → X directions.

Cycle Consistency

The most interesting idea behind CycleGANs (and the one from which they get their name) is the idea of introducing a cycle consistency loss to constrain the model. The idea is that when we translate an image from domain X to domain Y, and then translate the generated image back to domain X, the result should look like the original image that we started with. The cycle consistency component of the loss is the mean squared error between the input images and their reconstructions obtained by passing through both generators in sequence (i.e., fromdomain X to Y viathe XY generator, and then from domain Y back to X via the YX generator). The cycle consistency loss for the YXY cycle is expressed as follows:

1mi=1m(y(i)GXY(GYX(y(i))))2

CycleGAN Experiments

In this part we are checking the results from training for a short duration, say 600 iterations.

  1. After training the CycleGAN without the cycle-consistency loss from scratch, we get these results after 600 iterations:


    If we try using cycle-consistency loss, we get the following results:


  2. Here are some results after 10000 iterations of CycleGAN training (following are without cycle-consistency loss)!


    If we use cycle-consistency loss, the final results look as follows:


    In general, using the cycle consistency loss results in a slight improvement of the results, and the reason for this is because it makes sense for the generator to only generate images that can successfully be cycled through the other domain and continue to maintain validity when transformed back. This can be seen more clearly in the generated images of the Russian Blue cat, which are somewhat better with the cycle consistency loss than without it. It also might be possible to improve the results by tuning the cycle consistency weighting parameter lambda, but I didn't get a chance to try this out.

Bells & Whistles

  • I trained the vanilla DCGAN as well as the CycleGAN on the Pokemon dataset. Here are some results!

  • Do something cool with your model. This is a GIF of over 400 saved steps (200*400 = 80000 iterations) of training the GAN to generate Pokemon images (some of the best are shown above):

  • Train your GAN to generate higher-resolution images. I did this too! To implement this, I added extra deconv layers to the architecture of the vanilla GAN. Results are shown below for the grumpify cat dataset. I didn't train too much, which is why results aren't as ideal as they could be. Also a PatchGAN discriminator could improve the local features, but I didn't implement that either.