proj3_wentaiz

16-726 Learning-Based Image Synthesis, 2021 Spring

Project 3: When Cats meet GANs

Teddy Zhang (wentaiz)

Overview

Deep learning based image synthesis has been increasingly popular over the past decade. In particular, Generative Adversial Networks (GANs) have shown impressive capability of learning and synthesizing new images based on some given priors. In this project, the major task is to implement two major GAN framework for image synthesis, DCGAN and CycleGAN. The implemented models are verified by generating new cat images.

Deep Convolutional GAN

In this part, we are trying to learn the distribution from a set of images of the grumpy cat so that a series of synthesized new cat images can be sampled from the well-trained generator.

The DCGAN implemented in this project consists of 2 parts: a DC discriminator and a DC generator. The structure of the given discriminator and generator are shown in the figure below:

Some details of my implementation are:

Padding
- The general rule for dimension calculation of a convolutional layer is: $n_{out} = \frac{n_{in}+2p-k}{s}+1$ , where $n_{in}$ is the size of the input image (H/W), $p$ is the size of the padding, $k$ is the size of the kernel/filter. $s$ is the stride size.
- Here, based on the aforementioned rule, we can obtain that the padding for conv1, conv2, conv3 and conv4 layers are all 1. In the last layer, the padding is 0.
- For the generator, the padding for deconv1 is 0. In the other deconvolutional layers, the padding is 1.
Data Augmentation
- basic: Resized to 64, normalized.
- deluxe: basic + random crop +random horizontal flip
- advanced: deluxe + Color jitter + random resize
Loss
- Discriminator: $L_2$ on real images: $L_{D\_real}=MSE(D(x^i),1)$ , $L_2$ on fake images: $L_{D\_fake}=MSE(D(G(z^i)),0)$ , total loss: $L_D = \frac{1}{2}L_{D\_real}+\frac{1}{2}L_{D\_fake}$
- Generator: $L_2$ on the generated images: $L_{G}=MSE(D(G(z^i)),1)$ .

We trained the above model for 5000 epochs. Here are the training loss curves of this model with two different data augmentation methods (Basic vs Deluxe):

Fig.1 Loss curve for the discriminator. Green: Basic, Gray: Deluxe.

Fig.2 Loss curve for the generator. Green: Basic, Gray: Deluxe.

We can see from the plots above that the two curves in each plot basically follow the same trend. For the discriminator, the loss curve generally keeps decreasing with oscillations. While the loss for the generator rises within the first 400 epochs and then gradually decreases. The model with deluxe data augmentation yeilds a higher loss for the discriminator and lower loss for the discriminator.

Generated samples from both models after 5000 epochs are also shown in the figure below:

Fig.3a Basic

Fig.3b Deluxe

We can tell that the generated results are visisually more delicate when deluxed data augmentation is applied. To better understand the learning process, we also plot the generated samples during different epochs for the deluxe model:

Fig.4a 400 iterations	Fig.4b 800 iterations
Fig.4c 2000 iterations	Fig.4d 30000 iterations

From Fig4 a-d, we can conclude that the quality of the generated images keeps improving. Within the first 800 iterations, the networks learned the basic color distribution of the cat. Then the shape of the face and eyes were also captured when it came to 2000 iterations. Finally, it took a very long training process for the network to learn to complete the detailed organ textures.

CycleGAN

In this part, we are trying to learn a model to achieve the image-to-image translation. In the experiments, we will use two photo collections of grumpy cats and Russian blue cats.

The CycleGAN implemented in this project consists of 4 parts: a DC generator from X image to Y image, a DC generator from Y image to X image, a DC discriminator for X and a DC discriminator for Y. The discriminator is the same as the one aforementioned. The structure of both generators is shown in the figure below:

Some details of my implementation are:

Padding
- Based on the rule mentioned in the previous part, the padding size for all conv and deconv layers are all 1.
Data Augmentation
- basic: Resized to 64, normalized.
- deluxe: basic + random crop +random horizontal flip
- advanced: deluxe + Color jitter + random resize
Loss
- Discriminator:
  - $L_2$ on real images X and Y: $L_{D\_real}=MSE(D_X(x^i),1)+MSE(D_Y(y^i),1)$
  - $L_2$ on fake images X and Y: $L_{D\_fake}=MSE(D_Y(G_{X\rArr Y}(x^i)),0)+MSE(D_X((G_{Y\rArr X}(y^i)),0)$
- Generator:
  - $L_2$ on the generated images $Y\rArr X$ : $L_{G_{YX}}=MSE(D_X(G_{Y\rArr X}(y^i)),1)$
  - $L_2$ on the generated images $X\rArr Y$ : $L_{G_{XY}}=MSE(D_Y(G_{X\rArr Y}(x^i)),1)$
- Consistency Loss:
  - $L_2$ loss on $Y\rArr X \rArr Y$ cycle: $L_{YXY}=MSE(y^i, G_{X\rArr Y}(G_{Y\rArr X}(y^i)))$
  - $L_2$ loss on $X\rArr Y \rArr X$ cycle: $L_{XYX}=MSE(x^i, G_{Y\rArr X}(G_{X\rArr Y}(x^i)))$

We trained the above model w/ and w/o the consistency loss for 600 iterations. Here are the comparisons between the generated results of the two models:

Fig.5a w/o cycle loss, X to Y	Fig.5b w/ cycle loss, X to Y
Fig.5c w/o cycle loss, Y to X	Fig.5d w/ cycle loss, Y to X

Then, we keep training both models for 10000 iterations. The final resulting generated samples are shown below:

Fig.6a w/o cycle loss, X to Y	Fig.6b w/ cycle loss, X to Y
Fig.6c w/o cycle loss, Y to X	Fig.6d w/ cycle loss, Y to X

We can tell from both Fig 5 that the generated cats from X to Y align better in poses when the consistency loss is applied. The reason is that the consistency loss is forcing the one to one mapping.
In Fig 6, there is no significant difference between the generated results from X to Y between the models with or without cycle loss. However, the model with consistency loss generates much better results when it comes to Y to X image synthesis. A potential reason is that the variation in Russian blue cat collection is larger than the grumpy cats and harder to learn.