When Cats meet GANs

Carnegie Mellon University, 16-726 Learning-Based Image Synthesis, Spring 2021

Null Reaper Logo
Null Reaper (Clive Gomes)

Task

The goal of this project is to get hands-on experience coding and training GANs. Specifically, we designed the neural architecture and trained two kinds of GANs on cat images: DCGANs and CycleGANs. Both of these shall be discussed in the following sections of this report.

Deep Convolutional GAN (DCGAN)

A DCAN is a GAN that uses a CNN as the discriminator and a series of transposed convolutions as the generator. The purpose of this GAN is to take in a set of random values (or noise) and learn to generate images similar to those in the training dataset. In our case, we have used images of "grumpy" cats as shown below:

Grump Cats
Figure 1: Training Data—Images of Grumpy Cats

In order to improve the GANs output, a couple different data augmentation operations were considered:

  1. Basic: Only apply a simple normalization operation to shift the values of pixels to the range (-1, 1).
  2. Deluxe: First resize the image to 110% and then randomly crop to the original size. Then, randomly apply horizontal flips to images and normalize to (-1, 1) as in the basic augmentation.

Discriminator

We start by designing the discriminator for the DCGAN. A schematic for the CNN used in the project is given below. As can be seen, the network takes a 3-color channel 64x64 pixel image as an input and outputs a single number representing whether it thinks that the image is a real cat (output of 1) or a fake/generated cat (output of 0). The discriminator learns to make this distinction by performing a series of convolution operations with ReLU activation functions.

DCGAN Discriminator Architecture
Figure 2: CNN Architecture for the DCGAN Disciminator

As can be seen, the image is downsampled by a factor of 2 after each convolution operation. In our case, we use a kernel of size 4 and a stride of 2. If we perform the first row of convolution for the 64x64 input image, we will get a output of length 1 + (64 - 4)/2 = 31—the "1" is the case where the kernel is placed at the top-left corner of the image and the "(64 - 4)/2" are the number of strides possible whilst remaining within the image. To get an output of length 32 (which is 1/2 the original size), we need to make one more stride; for a stride length of 2, we need to use padding=1 (one pixel on each side) to accomplish this. The same is true for all the remaining convolution layers as well, except for the "conv5" layer: here, to get a 1x1 image from a 4x4 one, we simply perform the convolution on the original input, without any padding (in other words, padding=0 for this layer).

Generator

The Generator architecture is similar to the one for the Discriminator described above, except that the layers are in the reverse order, and deconvolution is performed instead of convolution to upsample the input. We also use 100 1x1 noise inputs as opposed to the single output of the discriminator. Finally, we add a deconv operation in the last layer of the Generator as the activation function; we use tanh here since the regular sigmoid function suffers from the problem of vanishing gradients.

DCGAN Generator Architecture
Figure 3: CNN Architecture for the DCGAN Generator

Training Loop

With our Generator and Discriminator ready, we then trained our DCGAN. The steps for the same are listed below.

  1. Take m training examples from the dataset.
  2. Take m noise samples from the normal distribution.
  3. Pass the noise samples through the generator network to produce m fake images.
  4. Compute the discriminator loss and update its weights through backpropagation.
  5. Take m new noise samples from the normal distribution and generate fake images as before.
  6. Compute the generator loss and update its weights through backpropagation.

The loss functions for the discriminator and generator are given below. In these equations, x denotes training examples and z, the noise samples.

Discriminator Loss

DCGAN Discriminator Loss Equation

Generator Loss

DCGAN Generator Loss Equation

Plots of Training Loss

Discriminator Loss (Basic Augmentation)

Plot of DCGAN Discriminator Loss for Basic Augmentation

Discriminator Loss (Deluxe Augmentation)

Plot of DCGAN Discriminator Loss for Deluxe Augmentation

Generator Loss (Basic Augmentation)

Plot of DCGAN Generator Loss for Basic Augmentation

Generator Loss (Deluxe Augmentation)

Plot of DCGAN Generator Loss for Deluxe Augmentation

In a GAN network, we train the discriminator and the generator alternately. At the start of training, both discriminator loss and generator loss are random (due to the random initialization of weights). As the generator gets better at producing cat images that trick the discriminator, the generator loss decreases but the discriminator loss increases. On the other hand, as the discriminator gets better at distinguishing real vs fake cat images, the discriminator loss decreases while the generator loss increases. Accordingly, the plot for these two losses should decrease for a while, increase back up and then decrease again, and so on. Over time though, as both the generator and discriminator get better, the peak losses decrease too. Both these points are in line with the plots obtained above.

If we compare the results for the basic and deluxe augmentations, we see that the loss in the deluxe case decreases much faster than that in the basic one; the best example of this is the plots for the fake loss of discriminator— there are a lot more spikes in the plot for the basic case. It would also not be surprising if the magnitude of minimum loss achieved for the deluxe case were smaller than that of the basic one. This would be expected since the purpose of random crops and flips in the deluxe augmentation is, in fact, to make the CycleGAN more robust and, consequently, to be able to achieve a lower loss.

Outputs

Below are example outputs of the DCGAN for the first few iterations of training; data augmentaton was set to "deluxe" to generate these results.

DCGAN Output
Figure 4: DCGAN Output Early in Training

As can be seen, the output starts off as noise, but slowly learns to generate cats as evident from the outputs after 400 and 600 iterations—faint outlines of cats are starting to become visible from wihin the noise. Let's take a look at the output much later in training, like after 10k iterations.

DCGAN Output
Figure 5: DCGAN Output Late in Training

The images now look a lot like real cats. There are still some problems with these images—like some noise at the edges of the images, the eyes of some cats are not aligned, the images are a bit blurry, etc.— but they are still great considering that they started off as random noise.

Below is a visual of how the output of the DCGAN evolved over the course of training.

DCGAN Output GIF
Figure 6: Evolution of DCGAN Output during Training on Cats Dataset

Testing on Another Dataset

In addition to the cat images, we also tried to use the DCGAN on images of fire pokemon. Examples of the input images are shown below.

Fire Pokemon
Figure 7: Training Data—Images of Fire Pokemon

The outputs after training for around 10k iterations were as follows:

DCGAN Output for Fire Pokemon
Figure 8: DCGAN Output for Fire Pokemon Dataset

Clearly, the outputs are not as good as the cats in the previous section. One obvious reason is that the fire pokemon dataset contains images of pokemon having different shapes, sizes and colors. Due to this, the DCGAN is not able to learn exactly what a "fire pokemon" is—when it optimizes its loss function towards one pokemon example, the next example is so different that it essentially unlearns what it just did (this can be observed in the GIF below). But even though it doesn't produce good outputs, it is still able to understand the common characteristics of fire pokemon—mainly, the colors used (if you compare the ouputs to the input images above, you can see that most of the same colors are used). Perhaps if we used a dataset containing pokemon of a more similar shape, we might get much better results.

DCGAN Pokemon Output GIF
Figure 9: Evolution of DCGAN Output during Training on Pokemon Dataset

CycleGAN

CycleGANs are used for image-to-image translation—for e.g., converting a photo taken during the day to one taken at night. In this project, we shall attempt to transform grumpy cats into russian blue cats; examples of both are shown below. The discriminator is the same as the one used in DCGAN earlier, and so, only the generator architecture will be discussed in the following sections.

Training Examples for CycleGAN
Figure 10: Training Data for CycleGAN—Grumpy Cat (left) vs Russian Blue Cat (right)

Generator

The generator architecture essentially comprises of an encoder network (a series of convolution layers), 1-3 residual blocks (that perform the domain transfer operation), and a decoder network (a series of deconvolution operations). Each of the residual blocks are essentially comprised of a convolution layer, where the input is added to the output of the convolution; this is done to ensure that the high-level characteristics of the output isn't too different from the input. The architecture diagram for the CycleGAN Generator used in this project is given below.

Architecture for CycleGAN Generator
Figure 11: Encoder-Decoder Architecture for CycleGAN

Training Loop

The steps for the training the CycleGAN are similar to that of DCGAB. Nevertheless, the entire sequence of events is listed below.

  1. Take m training examples from domain X (Russian Blue Cats).
  2. Take n training examples from domain Y (Grumpy Cats).
  3. Pass the training examples through the CycleGAN generator to produce translated images.
  4. Compute the discriminator loss on real and fake images.
  5. Update the weights of the X & Y discriminators through backpropagation.
  6. Produce another set of translated images.
  7. Compute the "X to Y" and "Y to X" generator losses and update their weights through backpropagation.

The loss functions for the discriminators and generators are given below.

Discriminator Loss for Real Images

CycleGAN Discriminator Loss Equation for Real Images

Discriminator Loss for Fake Images

CycleGAN Discriminator Loss Equation for Fake Images

Generator Loss for "Y to X" Translation

CycleGAN Generator Loss Equation for "Y to X" Translation

Generator Loss for "X to Y" Translation

CycleGAN Generator Loss Equation for "X to Y" Translation

The one thing in the above equations not mentioned earlier is the cycle-consistency loss, Jcycle. The idea here is that if we convert an image from domain X to Y and then back to domain X, the result should look similar to the input. Accordingly, we compute the L1 norm of the difference between the input image and the result of the cyclic transformation, and then try to minimize this error. The equation for this is given below.

CycleGAN Cycle Consistency Loss Equation

As seen in the generator loss equations earlier, we add this cycle-consistency loss with the mean-squared loss. However, we first multiply the cycle-consistency loss by a factor of lambda, allowing us to select the extent to which input characteristics should be mainintained in the output image.

Outputs

We first ran the CycleGAN for 600 iterations without cycle-consistency loss. The results are shown below:

CycleGAN Output w/o Cycle-Consistency Loss
Figure 12: Output for CycleGAN w/o Cycle-Consistency Loss at 600 Iterations

And here's the output with cycle-consistency loss (using lambda=100):

CycleGAN Output w/ Cycle-Consistency Loss
Figure 13: Output for CycleGAN w/ Cycle-Consistency Loss at 600 Iterations

Let's do a side-by-side comparison to see the difference (only parts of the previous images are taken, so that they can be easily compared).

CycleGAN Output w/ and w/o Cycle-Consistency Loss
Figure 14: W/o Cycle-Consistency Loss (left) vs W/ Cycle-Consistency Loss (right)

As mentioned earlier, the purpose of cycle-consistency loss is to maintain input image characteristics in the output. The most obvious example of this effect is the bottom left cat image in the screenshots above. In the output without cycle-consistency loss, the cat appears roundish and the cat's mouth is right below its nose, just like in the grumpy cat images. On the other hand, when cycle-consistency loss is used, the cat's face structure is more like the input (russian blue) cat image; its mouth is also shifted to the position as in the input image. Additionally, if you see the other output images on the right, you can see a tint of green in the background (especially in the top-row images). This, although unintended, is due to the fact that the russian blue cat images in the top row have a green background. We used a value of 100 for lambda here but, depending on whether we increase or decrease it, we can control the extent to which the input characteristics are maintained. The reason for this is that the generator loss function is a sum of the mean-squared loss and the cycle-consistency loss; if we increase lambda, we force the algorithm to focus on minimizing the cycle-consistency loss more, in order to reduce the overall loss.

Both models (with and without cycle-consistency loss) were trained once again, but for 10k iterations. The results are below:

CycleGAN Output w/o Cycle-Consistency Loss
Figure 15: Output for CycleGAN w/o Cycle-Consistency Loss at 10K Iterations
CycleGAN Output w/ Cycle-Consistency Loss
Figure 16: Output for CycleGAN w/ Cycle-Consistency Loss at 10K Iterations

Just like it was at 600 iterations, the cats in the outputs without cycle-consistency loss have round-ish faces like the grump cat images, while the ones with cycle-consistency loss conform more to the face structure of cats in the input images. After 10k cycles, the outputs have certainly become more well-defined, however, there is a bit more noise in the cycle-consistency loss case. Upon checking the training log, it was noticed that the generator loss kept fluctuating between a small range of values (as did the discriminator losses). This means that as the generator tried to get better, the discriminator got worse; but when the discriminator tried to get better again, the increase in the generator loss was so high that the change simply had to be undone (to an extent). Perhaps the choice of lambda was too high and better results may be obtained by lowering the value.

We end by providing visuals of the CycleGAN outputs over the course of 10k iterations:

CycleGAN w/o Cycle-Consistency Loss Output GIF CycleGAN w/ Cycle-Consistency Loss Output GIF
Figure 17: Evolution of CycleGAN Output—W/o Cycle-Consistency Loss (top) vs W/ Cycle-Consistency Loss (bottom)—X to Y Domain
CycleGAN w/o Cycle-Consistency Loss Output GIF CycleGAN w/ Cycle-Consistency Loss Output GIF
Figure 18: Evolution of CycleGAN Output—W/o Cycle-Consistency Loss (top) vs W/ Cycle-Consistency Loss (bottom)—Y to X Domain

Bells & Whistles

Get your GAN and/or CycleGAN to work on another dataset

As can be seen in the last section of DCGAN, the Fire Pokemon dataset was also used to generate (or attempt to generate) new Pokemon images; refer to the discussion there for further details.

Generate a GIF video or create a meme using your model

For all outputs in this report, GIF videos showing the progression of the GAN output over the course of 10k training iterations have been shown.