*CMU 16-726 Spring 2021 Assignment #3*
Name: Juyong Kim
![Loss and sample generation of DCGAN over training](images/vanillagan/grumpifyBprocessed_deluxe_bceloss.gif)
(#) Introduction
In this assignment, we will implement two famouse GAN (generative adversarial network) models and train the networks on a real image dataset.
In the first part, we will implement a specific type of GAN designed to process images, called a Deep Convolutional GAN (DCGAN) to generate dome grumpy cats from noise.
In the second part, we will implement CycleGAN, a more complex GAN architecture for image-to-image translation, to convert between two kinds of cats.
(#) Part 1: Deep Convolutional GAN
Deep Convolution GAN [#Radford16] is a specific type of GAN that uses convolutional layers (`conv`) and de-convolutional (`deconv`) layers.
The objective function of the original GAN is defined as
$$ \mathcal{L}_\text{GAN}(G, D) = \mathbb{E}_x[\log D(x)] + \mathbb{E}_{z}[\log (1-D(G(z)))], $$
where $G$ and $D$ are the generator and the discriminator, $x$ is a real image, and $z$ is a noise where the generated image is made from.
The training of GAN is a minimax optimization where $D$ maximizes this objective and while $G$ minimizes it.
In the DCGAN, $G$ and $D$ are defined as a conv net and a deconv net to generate and discriminated images, respectively.
To implement the DCGAN, we need to specify three things: 1) the generator, 2) the discriminator, and 3) the training procedure.
In the assignment, the skeleton for the DCGAN framework is already provided and only core functionalities are needed to implement the predefined function signatures, which are described below.
(##) Data augmentation
The sampled real images need to be augmented because the descriminator can easily overfit to them otherwise.
So we use data augmentation and provide diverse distribution of real images.
In this assigment, we use various transformations such as resize, random crop, and horizontal flip to define a *deluxe* option of data augmentation.
Also, we linearly re-range the pixel values to [-1, 1] to make it easier for generate to produce.
In Pytorch, such image transformations are already implemented in `torchvision.transforms` package.
What we only need to do is to specify the transformations in `data_loader.py` and pass them to the definition of `CustomDataset` which is already provided to incorporate transformations into dataset pipeline.
(##) The discriminator of the DCGAN
The discriminator is a neural network that takes in an input image of size $64 \times 64$, and classifies whether the image is real or fake.
The network consists of five `conv` layers, accompanied by the instance norm and the ReLU layers.
In Pytorch, a network is a class that inherits `nn.Module` class, and we implement the initialization method (`__init__()`) and the `forward` method of the `DCDiscriminator` class in `models.py`.
Also, we can use `conv()` method provided in the assignment to easily use the Pytorch layers.
The size of tensors at each layer and the kernel size and the stride of the conv layers are provided by the assignment.
And we need to decide the padding width to make the shape of the tensors correspond.
For the output length (width or height) is decided by following:
$$ l_\text{out} = \bigg[ \frac{l_\text{in} + 2p - k}{s} \bigg] + 1, $$
where $k$, $s$, and $p$ are the kernel size, the stride, and the padding width, respectively.
Using $k = 4$ and $s = 2$ which are fixed for the assignment, the padding widths are computed as follows:
Layer | $l_\text{in}$ | $l_\text{out}$ | $p$
-------|---------------|----------------|----
`conv1`| 64 | 32 | 1
`conv2`| 32 | 16 | 1
`conv3`| 16 | 8 | 1
`conv4`| 8 | 4 | 1
`conv5`| 4 | 1 | 0
(##) The generator of the DCGAN
The generator consists of a sequence of transpose convolutional layers that progressively upsample the input noise sample to generate a fake image.
The size of the tensors are defined by the assignment for here too, and similar work is done to `DCGenerator` class using provided `deconv()` method.
Because the shape of the tensors are in the reversed order of the generator, we can use the same values of the padding.
At the last layer, we can put $\tanh$ activation to make the range of the pixel values inbetween -1 and 1, or output the conv output as it is.
(##) Loss and Training Loop
In the training of GAN, we alternate the gradient update steps of the discriminator and the generator.
As suggested in the assignment, we perform the SGD step of discriminator every 2 steps of the SGD step of generator.
The choice of loss function can be differ by whether we use sigmoid activation after the last layer of the discriminator.
We tried $\ell_2$ loss for the case there is no activation at the last discriminator layer (this is what the assignment suggests), and binary cross entropy loss (BCE loss) when sigmoid activation is used.
One trick we used for the BCE loss is that we minimize $-\log D(G(z))$ instead of maximizing $\log (1-D(G(z)))$.
This objective for G is introduced in the original GAN paper [#GoodFellow14], and it provides much stronger gradients early in training.
(#) Part 2: CycleGAN
The second part of the assignment is to fill the missing parts in the CycleGAN [#Zhu17] code.
In CycleGAN, we convert images from two domains, $X$ and $Y$, into each other's domain, instead of generating images from random noises.
To do that we have two generators, $G_{X\rightarrow Y}$ and $G_{Y\rightarrow X}$, and two discriminators, $D_X$ and $D_Y$.
As the model has more component, we need more term and optimization techniques to train the model.
(##) Model architecture
The generators $G_{X\rightarrow Y}$ and $G_{Y\rightarrow X}$ are defined as bottleneck architecture, which has `conv`, residual block (`ResnetBlock`), and `deconv` layers.
The specific hyper-parameters, such as filter size and the number of layers, are chosen as the assignment suggests.
The discriminators has the same architecture as in DCGAN.
(##) Loss and Training Loop
In CycleGAN, in addition to the original GAN loss, the cycle-consistency loss is adopted for the mode collapse of the generator.
The cycle-consistency loss is added to the GAN loss for the generator and defined as the reconstruction loss of the images of both domain.
One term of the cycle-consistency loss of $X \rightarrow Y \rightarrow X$ can be written as
$$ \mathcal{L}_\text{cycle}^{X\rightarrow Y\rightarrow X} = \frac{\lambda}{m} \sum_{i=1}^m \| x^{(i)} - G_{Y\rightarrow X}(G_{X\rightarrow Y}(x^{(i)})) \|_p, $$
where $x^{(i)}$ is a sampled image of domain $X$ and $m$ is the size of mini batch.
The constant $p$ define the type of norm used in the reconstruction, and common choices of $p$ are 1 ($\ell_1$ loss) and 2 ($\ell_2$ loss).
Althought it's expressed as norm, the actual implementation is to average over all the pixel, i.e. to divide the norm by the number of all elements of the 4-rank tensor.
The coefficient $\lambda$ is chosen differently by $p$, and the recommendation from the course is $\lambda=10$ for $p=1$ and $\lambda=100$ for $p=2$.
We use these $\lambda$ values in the experiments.
(#) Experiments and Results
(##) DCGAN Experiments
In the DCGAN experiment, we train the model to genarate images of grumpy cats (`cat/grumpifyBprocessed`).
The dataset has 204 cat images of $64\times 64$ size, and the model is trained with Adam optimizer (0.5, 0.999) and learning rate $3\times 10^{-4}$ and batch size 17 (to make all the batch has the same number of examples).
We trained the model for 750 epochs, which makes the number of iterations 9000.
We tried the training with various configurations of data augmentation (`basic` and `deluxe`) and the discriminator loss function ($\ell_2$ loss and BCE loss).
![Figure [vanilla-basic-l2]: Basic / $\ell_2$ loss](images/vanillagan/grumpifyBprocessed_basic_l2loss-009000.png) ![Figure [vanilla-basic-bce]: Basic / BCE loss](images/vanillagan/grumpifyBprocessed_basic_bceloss-009000.png)
![Figure [vanilla-deluxe-l2]: Deluxe / $\ell_2$ loss](images/vanillagan/grumpifyBprocessed_deluxe_l2loss-009000.png) ![Figure [vanilla-deluxe-bce]: Deluxe / BCE loss](images/vanillagan/grumpifyBprocessed_deluxe_bceloss-009000.png)
Figure [vanilla-basic-l2]~Figure [vanilla-deluxe-bce] show the generator output after 750 epochs (9000 iterations) of training.
As we can see the difference clearly, the richer data augmentation (`deluxe`) and the BCE loss are superior than the simpler data augmentation (`basic`) and the $\ell_2$ loss.
Espacially, using `deluxe` augmentation improves not only the quality but also the diversity of the outputs.
![Figure [vanilla-deluxe-bce-plot]: Loss plot and generation result of DCGAN over training (of Figure [vanilla-deluxe-bce])](images/vanillagan/grumpifyBprocessed_deluxe_bceloss.gif)
![Figure [vanilla-deluxe-bce-0200]: 200 steps](images/vanillagan/grumpifyBprocessed_deluxe_bceloss-000200.png) ![Figure [vanilla-deluxe-bce-1000]: 1000 steps](images/vanillagan/grumpifyBprocessed_deluxe_bceloss-001000.png) ![Figure [vanilla-deluxe-bce-9000]: 9000 steps](images/vanillagan/grumpifyBprocessed_deluxe_bceloss-009000.png)
To see the generator outputs over the training (of the best configuration, `deluxe` and BCE loss, Figure [vanilla-deluxe-bce]), we plot the loss of G and D with the sampled generator outputs of several timesteps in Figure [vanilla-deluxe-bce-plot].
And also Figure [vanilla-deluxe-bce-0200]~Figure [vanilla-deluxe-bce-9000] show the generation results at the very early training (200 steps), at the minimum of the G loss (1000 steps), and at the end of the training (9000 steps).
The results at each position are generated with the same noise, so we can see on the effect of the generator on the output.
At early stage of the training (Figure [vanilla-deluxe-bce-0200]), the generator outputs vague shape of the cat, but the details of the outputs improve over the training.
One interesting observation is that the output quality improves even after G achieves its minimum loss, showing much better results at timestep 9000 than the results at timestep 1000.
We think that such phenomenon happens because the discriminator improves over the training faster than the generator, so the genrator is having hard time deluding the discriminator.
So the current increase in G loss simply does not mean the deterioration of the G output.
(##) CycleGAN Experiments
In the CycleGAN experiment, we convert the images of two cats into each other.
The dataset of each cat (`cat/grumpifyAprocessed` and `cat/grumpifyBprocessed`) have 75 and 204 images, respectively.
The CycleGAN model is trained with the same optimizer and hyper-parameters excepts the batch size (16) and the number of training steps (10000).
We fixed the data augmentation method to `deluxe`, and instead, the experiment is performed on the different loss functions for the GAN loss and cycle-consistency loss.
Similarly to the DCGAN experiment, we tested $\ell_2$ loss and BCE loss for the GAN loss, and thereby the activation at the final D layer.
For the cycle-consistency loss, we tested three cases: 1) no cycle loss, 2) $\ell_1$ loss, and 3) $\ell_2$ loss.
![Figure [cycle-bceloss-nocycle-xtoy]: BCE loss / no cycle loss / $X\rightarrow Y$](images/cyclegan/deluxe_bceloss-010000-X-Y.png) ![Figure [cycle-bceloss-nocycle-ytox]: BCE loss / no cycle loss / $Y\rightarrow X$](images/cyclegan/deluxe_bceloss-010000-Y-X.png)
![Figure [cycle-l2loss-nocycle-xtoy]: $\ell_2$ loss / no cycle loss / $X\rightarrow Y$](images/cyclegan/deluxe_l2loss-010000-X-Y.png) ![Figure [cycle-l2loss-nocycle-ytox]: $\ell_2$ loss / no cycle loss / $Y\rightarrow X$](images/cyclegan/deluxe_l2loss-010000-Y-X.png)
![Figure [cycle-bceloss-cycle1-xtoy]: BCE loss / cycle $\ell_1$ loss / $X\rightarrow Y$](images/cyclegan/deluxe_bceloss_cycle1-010000-X-Y.png) ![Figure [cycle-bceloss-cycle1-ytox]: BCE loss / cycle $\ell_1$ loss $Y\rightarrow X$](images/cyclegan/deluxe_bceloss_cycle1-010000-Y-X.png)
![Figure [cycle-l2loss-cycle1-xtoy]: $\ell_2$ loss / cycle $\ell_1$ loss / $X\rightarrow Y$](images/cyclegan/deluxe_l2loss_cycle1-010000-X-Y.png) ![Figure [cycle-l2loss-cycle1-ytox]: $\ell_2$ loss / cycle $\ell_1$ loss / $Y\rightarrow X$](images/cyclegan/deluxe_l2loss_cycle1-010000-Y-X.png)
![Figure [cycle-bceloss-cycle2-xtoy]: BCE loss / cycle $\ell_2$ loss / $X\rightarrow Y$](images/cyclegan/deluxe_bceloss_cycle2-010000-X-Y.png) ![Figure [cycle-bceloss-cycle2-ytox]: BCE loss / cycle $\ell_2$ loss / $Y\rightarrow X$](images/cyclegan/deluxe_bceloss_cycle2-010000-Y-X.png)
![Figure [cycle-l2loss-cycle2-xtoy]: $\ell_2$ loss / cycle $\ell_2$ loss / $X\rightarrow Y$](images/cyclegan/deluxe_l2loss_cycle2-010000-X-Y.png) ![Figure [cycle-l2loss-cycle2-ytox]: $\ell_2$ loss / cycle $\ell_2$ loss / $Y\rightarrow X$](images/cyclegan/deluxe_l2loss_cycle2-010000-Y-X.png)
Figure [cycle-bceloss-nocycle-xtoy]~Figure [cycle-l2loss-cycle2-ytox] show the generator outputs on the sample images, on both directions, for different configuations of the GAN loss and the cycle-consistency loss (Captions indicatie the type of GAN loss / cycle loss / G direction).
In general converting from Y to X is difficult that the opposite direction.
Differently from the DCGAN experment, the output is much more stable with $\ell_2$ loss than BCE loss for the GAN loss.
Among the configurations of the cycle-consistency loss, both $\ell_1$ and $\ell_2$ loss have same quality and not using the cycle-consistency loss deteriorates the results, espacially the generation in more difficult direction $Y \rightarrow X$.
![Figure [cycle-l2loss-cycle2-plot]: Loss plot and generation result of CycleGAN over training (of Figure [cycle-l2loss-cycle2-xtoy] and Figure [cycle-l2loss-cycle2-ytox])](images/cyclegan/deluxe_l2loss_cycle2.gif)
We choose one of the best configuations (the $\ell_2$ GAN loss and the $\ell_2$ cycle loss, Figure [cycle-l2loss-cycle2-xtoy] and Figure [cycle-l2loss-cycle2-ytox]) and plot the loss function and the generation results over the training.
The overall quality of the generation and the losses stays similar after few thousand steps.
Even with the best configurations, the generation outputs quality is far inferior to the original dataset, and we think that we can improve this with larger dataset, more data augmentation, and more GAN techniques.
**Bibliography**:
[#GoodFellow14]: Goodfellow, Ian, et al. 2014. Generative Adversarial Nets.
In _Advances in Neural Information Processing Systems (NIPS '14)_, https://arxiv.org/pdf/1406.2661.pdf
[#Radford16]: Radford, Alec, et al. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In _International Conference on Learning Representations (ICLR '16)_, https://arxiv.org/pdf/1511.06434v2.pdf
[#Zhu17]: Zhu, Jun-Yan, et al. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In _Proceedings of the IEEE international conference on computer vision (ICCV '17)_, https://arxiv.org/pdf/1703.10593.pdf