**CMU 16-726 Learning-based Image Synthesis** **Assignment #3** *Title: "When Cats meet GANs"* *Name: Soyong Shin (soyongs@andrew.cmu.edu)* (##) Contents * Part 1 DC-GAN * Part 2 Cycle-GAN * Part 3 Bells & Whistles (##) Part 1 DC-GAN For the first part, I have implemented Deep Convolutional Generative Adversarial Network (i.e., DC-GAN). The model architecture follows the figure given by an assignment instruction as below: ![figure [model_architecture]: Discriminator Architecture](report/Figure1.png) ![figure [model_architecture]: Generator Architecture](report/Figure2.png)
In this section, I will describe each part of DC-GAN and training algorithm, and discuss the results.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.1 Discriminator** DC-GAN discriminator, $\mathcal{D}$, takes a batch of images as input and outputs the probabilities (0 to 1) of real images. In order to modify the size of intermediate feature maps as shown in Figure 1, a set of kernel size, stride length, and padding should be tuned. Since kernel size $K$ and stride length $S$ are given as $K=4$, $S=2$ from the problem, I obtained padding $P$ from the equation shown below. $$ W_{out} = \frac{W_{in} + 2 \cdot P - (K - 1) + 1}{S} + 1 $$ Note that $W_{in}$ and $W_{out}$ is input and output width of feature map, and dilation was assumed to be 1. Here, I only consider the width of the feature map since we assume the shape of feature map as a square (i.e. width = height). As the size of feature map reduces by the convolutional network, $W_{in}$ can be substituted as: $$ W_{in} = r \cdot W_{out} $$ where $r$ is a ratio between $W_{in}$ and $W_{out}$. Therefore, the padding size $P$ is: $$ P = \frac{S \cdot W_{out} - r \cdot W_{out} - S + K}{2} $$ This part is implemented as the function ***get_padding*** at ***models.py***. The source code of discriminator class is as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class DCDiscriminator(nn.Module):
    def __init__(self, norm='batch'):
        super(DCDiscriminator, self).__init__()
 
        K = 4   # kernel_size
        S = 2   # stride
 
        self.conv1 = conv(3, 32, K, S, get_padding(32, K, S), norm=norm)
        self.conv2 = conv(32, 64, K, S, get_padding(16, K, S), norm=norm)
        self.conv3 = conv(64, 128, K, S, get_padding(8, K, S), norm=norm)
        self.conv4 = conv(128, 256, K, S, get_padding(4, K, S), norm=norm)
        self.conv5 = conv(256, 1, K, S, get_padding(1, K, S, factor=4), norm='none')
 
    def forward(self, x):
 
        out = F.relu(self.conv1(x))
        out = F.relu(self.conv2(out))
        out = F.relu(self.conv3(out))
        out = F.relu(self.conv4(out))
        out = self.conv5(out).squeeze()
 
        return out
Colored by Color Scripter
cs

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.2 Generator** DC-GAN generator $\mathcal{G}$, on the other hand, generates images from n-dimensional latent vector $z \in \mathbb{R}^{100}$. For this, deconvolutional layer (PyTorch *ConvTranspose2d* layer) was used. Padding size $P$ was set to be 0 for the first layer, 1 for the else to match the feature size. The source code of generator class is as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class DCGenerator(nn.Module):
    def __init__(self, noise_size, norm):
        super(DCGenerator, self).__init__()
        
        K = 4
        S = 2
        
        self.deconv1 = deconv(100, 256, K, 1, padding=0, norm=norm)
        self.deconv2 = deconv(256, 128, K, S, padding=1, norm=norm)
        self.deconv3 = deconv(128, 64, K, S, padding=1, norm=norm)
        self.deconv4 = deconv(64, 32, K, S, padding=1, norm=norm)
        self.deconv5 = deconv(32, 3, K, S, padding=1, norm='none')
 
    def forward(self, z):
        
        out = F.relu(self.deconv1(z))
        out = F.relu(self.deconv2(out))
        out = F.relu(self.deconv3(out))
        out = F.relu(self.deconv4(out))
        out = F.tanh(self.deconv5(out))
        
        return out
Colored by Color Scripter
cs

------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.3 Training** The training algorithm was implemented on ***vanilla_gan.py***. The training script runs with a few input parsers with default values.
*1.3.1. Discriminator Loss* Since discriminator, $\mathcal{D} tries to discriminate fake images from real images, loss function $mathcal{L}$ for it is set as: $$ \mathcal{L}_{D, real} = \frac{1}{2m} \sum_{b}^{m}{(1 - \mathcal{D}(I_{b, real}))^2} $$ $$ \mathcal{L}_{D, fake} = \frac{1}{2m} \sum_{b}^{m}{\mathcal{D}(I_{b, fake})^2} \hspace{7mm} (I_{b, fake} = \mathcal{G}(z_{b})) $$ $$ \mathcal{L}_D = \mathcal{L}_{D, real} + \mathcal{L}_{D, fake} $$
*1.3.2. Generator Loss* For generator $\mathcal{G}$, the objective is to "deceive discriminator by fake images", thus, the loss function is defined as the difference between fake images and real labels. $$ \mathcal{L}_{G} = \frac{1}{m} \sum_{b}^{m}{(1 - \mathcal{D}(I_{b, fake}))^2} = \frac{1}{m} \sum_{b}^{m}{(1 - \mathcal{D}(\mathcal{G}(z_b)))^2} $$ The algorithm of training both $\mathcal{D}$ and $\mathcal{G}$ is following the direction from assignment, ![figure [dcgan_training]: Training Algorithm for DC-GAN](report/Figure3.png)
------------------------------------------------------------------------------------------------------------------------------------------------------------ **1.4 Experiments Results** I have conducted four experiments (data augmentation: basic/deluxe, cat species: A/B) for the aforementioned DC-GAN and here I report the results.
*1.4.1. Loss graphs* ![figure [dcgan_training_loss]: Discriminator loss 1](report/Figure4_2.png) ![figure [dcgan_training_loss]: Discriminator loss 2](report/Figure4_1.png) ![figure [dcgan_training_loss]: Generator loss](report/Figure4_3.png) This loss graphs are when training DC-GAN on *grumpifyAprocessed* data with basic augmentation. The training lasts for 500 epochs.
*1.4.2. Qualitative results* ![figure [dcgan_training_results]: 200 iter.](report/Figure5_1.png) ![figure [dcgan_training_results]: 800 iter.](report/Figure5_2.png) ![figure [dcgan_training_results]: 1400 iter.](report/Figure5_3.png) ![figure [dcgan_training_results]: 5000 iter.](report/Figure5_5.png) Figure 7-10 shows how the generated images quality changes with the number of training iteration. This training was on *grumpifyAprocessed* data with deluxe augmentation. The network initially generates noisy images and well-trained to generate high quality results.
*1.4.3. Qualitative comparison with data augmentation* ![figure [dcgan_training_results]: Basic augmentation](report/Figure6_1.png) ![figure [dcgan_training_results]: Deluxe augmentation](report/Figure6_2.png) Figure 11 and 12 are the results of generator on *grumpifyBprocessed* for 13k training with **basic/deluxe** data augmentation respectively. Deluxe augmentation has 5 layers including *ToTensor()*. The code of deluxe transformer is implemented as below.

1
2
3
4
5
6
7
8
9
10
load_size = int(1.1 * opts.image_size)
osize = [load_size, load_size]
transform_layers = [
    transforms.Resize(osize, Image.BICUBIC),
    transforms.RandomCrop(opts.image_size),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
]
transform = transforms.Compose(transform_layers)
cs

The Figure shows augmentation with **deluxe** configuration train the network better. (##) Part 2 Cycle-GAN Next, I implemented Cycle-GAN network that transforms the style of two types of input data. Here, the network architecture of discriminator was defined identically with that of DC-GAN. However, the generator is different. The generator does not take normal-distributed noise anymore, but takes image as an input and output transformed image. The network architecture of the generator is shown in the Figure 13. ![figure [model_architecture]: Cycle-GAN Generator Architecture](report/Figure7.png)
In this section, I will cover detailed description on the generator and training algorithm.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.1 Generator** The generator $\mathcal{G}$ takes either X or Y (where X and Y are the types of images) as an input variable. Then using encoder-decoder architecture, it outputs Y or X (input X: output Y // input Y: output X). In other words, there are two generators $\mathcal{G}_{X \rightarrow Y}$ and $\mathcal{G}_{Y \rightarrow X}$. $\mathcal{G}_{X \rightarrow Y}$ takes $I_X$ as input image and tries to outputs $I_Y$ which transfer the style of X to Y. For encoder part, ***get_padding*** function was used again to meet the feature map size same as the given Figure 13. Then, I used 3 residual blocks and 2 layers of deconvolution to decode transformed image.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class CycleGenerator(nn.Module):
    def __init__(self, norm='batch'):
        super(CycleGenerator, self).__init__()
 
        K = 4
        S = 2
 
        # 1. Define the encoder part of the generator (that extracts features from the input image)
        self.conv1 = conv(3, 32, K, S, get_padding(32, K, S), norm=norm)
        self.conv2 = conv(32, 64, K, S, get_padding(16, K, S), norm=norm)
 
        # 2. Define the transformation part of the generator
        self.resnet_block1 = ResnetBlock(64, norm=norm)
        self.resnet_block2 = ResnetBlock(64, norm=norm)
        self.resnet_block3 = ResnetBlock(64, norm=norm)
 
        # 3. Define the decoder part of the generator (that builds up the output image from features)
        self.deconv1 = deconv(64, 32, K, S, padding=1, norm=norm)
        self.deconv2 = deconv(32, 3, K, S, padding=1, norm='none')
 
    def forward(self, x):
 
        out = F.relu(self.conv1(x))
        out = F.relu(self.conv2(out))
 
        out = F.relu(self.resnet_block1(out))
        out = F.relu(self.resnet_block2(out))
        out = F.relu(self.resnet_block3(out))
 
        out = F.relu(self.deconv1(out))
        out = F.tanh(self.deconv2(out))
 
        return out
Colored by Color Scripter
cs

The code for the generator is implemented as above.

------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.2 Training** The training script for Cycle-GAN was implemented on ***cycle_gan.py***. Unlike DC-GAN which is mainly aiming to train generator to generate realistic images, Cycle-GAN is to train generate to swap the style of two sets well. Therefore, the training algorithm of Cycle-GAN is apart from DC-GAN.
*2.2.1. Discriminator Loss* There are two discriminators $\mathcal{D}_X$ and $\mathcal{D}_Y$, each of them is to discriminate corresponding style set images. The loss function of discriminator, $\mathcal{D}_X$ is given as: $$ \mathcal{L}_{D_X, real} = \frac{1}{2m} \sum_{b}^{m}{(1 - \mathcal{D}_X(I_{X, b, real}))^2} $$ $$ \mathcal{L}_{D_X, fake} = \frac{1}{2m} \sum_{b}^{m}{\mathcal{D}_X(I_{X, b, fake})^2} \hspace{7mm} (I_{X, b, fake} = \mathcal{G}_{Y \rightarrow X}(I_{Y, b, real})) $$ $$ \mathcal{L}_{D_X} = \mathcal{L}_{D_X, real} + \mathcal{L}_{D_X, fake} $$ Likewise, loss fuction for $\mathcal{D}_Y$ is given as: $$ \mathcal{L}_{D_Y} = \frac{1}{2m} \sum_{b}^{m}{(1 - \mathcal{D}_Y(I_{Y, b, real}))^2} + \frac{1}{2m} \sum_{b}^{m}{\mathcal{D}_Y(\mathcal{G}_{X \rightarrow Y}(I_{X, b, real}))^2} $$
*2.2.2. Generator Loss* Inversely, we train generators $\mathcal{G}_{X \rightarrow Y}$ and $\mathcal{G}_{Y \rightarrow X}$ to deceive corresponding discriminators $\mathcal{D}_Y$ and $\mathcal{D}_X$. Therefore, the loss function for each generator is given as: $$ \mathcal{L}_{G_{X \rightarrow Y}} = \frac{1}{m} \sum_{b}^{m}{(1 - \mathcal{D}_{Y}(I_{X, b, fake}))^2} = \frac{1}{m} \sum_{b}^{m}{(1 - \mathcal{D}_{Y}(\mathcal{G}_{X \rightarrow Y}(I_{Y, b, real})))^2} $$ $$ \mathcal{L}_{G_{Y \rightarrow X}} = \frac{1}{m} \sum_{b}^{m}{(1 - \mathcal{D}_{X}(I_{Y, b, fake}))^2} = \frac{1}{m} \sum_{b}^{m}{(1 - \mathcal{D}_{X}(\mathcal{G}_{Y \rightarrow X}(I_{X, b, real})))^2} $$
*2.2.3. Cycle Consistency Loss* Assume $I_{X, \mathcal{C}}$ has style X and content $\mathcal{C}$ . The objective of Cycle-GAN is to transfer style, not content, therefore the generator $\mathcal{G}_{X \rightarrow Y}$ aims to generate $I_{Y, \mathcal{C}}$ that has content of $\mathcal{C}$ and style of Y from $I_{X, \mathcal{C}}$. On the other hand, the other generator $\mathcal{G}_{Y \rightarrow X}$ tries to reconstruct $I_{X, \mathcal{C}}$ that has $\mathcal{C}$ content and X styles when inputting $I_{Y, \mathcal{C}}$. From this intuition, we can think it as: $$ I_{X, \mathcal{C}} = \mathcal{G}_{Y \rightarrow X}(I_{Y, \mathcal{C}}) = \mathcal{G}_{Y \rightarrow X}(\mathcal{G}_{X \rightarrow Y}(I_{X, \mathcal{C}})) = \mathcal{G}_{X \rightarrow Y \rightarrow X}(I_{X, \mathcal{C}}) $$ Here $\mathcal{G}_{X \rightarrow Y \rightarrow X}(\cdot) = \mathcal{G}_{Y \rightarrow X} (\mathcal{G}_{X \rightarrow Y}(\cdot))$. From this we can build two cycle consistency losses $\mathcal{L}_{cycle}^{X \rightarrow Y \rightarrow X}$ and $\mathcal{L}_{cycle}^{Y \rightarrow X \rightarrow Y}$. Two losses functions are given as: $$ \mathcal{L}_{cycle}^{X \rightarrow Y \rightarrow X} = \lambda \cdot \frac{1}{m} \sum_{b}^{m}{|| I_{X, b, real} - \mathcal{G}_{Y \rightarrow X}(\mathcal{G}_{X \rightarrow Y}(I_{X, b, real}))||_1} $$ $$ \mathcal{L}_{cycle}^{X \rightarrow X \rightarrow Y} = \lambda \cdot \frac{1}{m} \sum_{b}^{m}{|| I_{Y, b, real} - \mathcal{G}_{X \rightarrow Y}(\mathcal{G}_{Y \rightarrow X}(I_{Y, b, real}))||_1} $$ As shown in the equations above, the loss functions were given in L1 loss and the weight $\lambda$ was set to 10. The algorithm of training four networks $\mathcal{D}_{X}$, $\mathcal{D}_{Y}$, $\mathcal{G}_{X \rightarrow Y}$, and $\mathcal{G}_{X \rightarrow Y}$ is following the direction from assignment, ![figure [dcgan_training]: Training Algorithm for Cycle-GAN](report/Figure8.png) ------------------------------------------------------------------------------------------------------------------------------------------------------------ **2.3 Experiments Results** For Cycle-GAN task, I conducted two different experiments, with/without **cycle consistency loss** term. Here I report the results. Note that the discriminator architecture was identical with DC-GAN.
*2.3.1. Loss graphs* ![figure [cyclegan_training_loss]: Discriminator losses](report/Figure9_1.png) ![figure [cyclegan_training_loss]: Generator losses](report/Figure9_2.png) These graphs are the results from training on cat dataset, that transferring styles of *grumpifyAprocessed* and *grumpifyBprocessed*, and cycle consistency loss is used. Overall, the losses are well-converged, however, generator $\mathcal{G}_{Y \rightarrow X}$ (here X is *grumpifyAprocessed* and Y is *grumpifyBprocessed*) has a little problem in training.
*2.3.2. Qualitative results* ![figure [cyclegan_results]: (A to B) 250 iter.](report/Figure10_1.png) ![figure [cyclegan_results]: (A to B) 750 iter.](report/Figure10_2.png) ![figure [cyclegan_results]: (A to B) 10000 iter.](report/Figure10_3.png)
![figure [cyclegan_results]: (B to A) 250 iter.](report/Figure11_1.png) ![figure [cyclegan_results]: (B to A) 750 iter.](report/Figure11_2.png) ![figure [cyclegan_results]: (B to A) 10000 iter.](report/Figure11_3.png) Figures 17-22 show the qualitative results of Cycle-GAN by the training iterations. Sample results are getting better as the training has been processed and at the end of training (i.e. 10,000 iterations), the output looks visually acceptable in general. However, as mentioned in 2.3.1., $\mathcal{G}_{Y \rightarrow X}$ that transfers B to A shows relatively poor quality. All the figures are when applying cycle consistency loss and deluxe data augmentation.
*2.3.3. Qualitative comparison with cycle consistency loss* ![figure [cyclegan_results]: (A to B) **without** cycle loss](report/Figure12_1.png) ![figure [cyclegan_results]: (A to B) **with** cycle loss](report/Figure12_2.png)
![figure [cyclegan_results]: (B to A) **without** cycle loss](report/Figure13_1.png) ![figure [cyclegan_results]: (B to A) **with** cycle loss](report/Figure13_2.png) Figures 23-26 compare the qualitative results between **with/without** cycle consistency loss. For *A to B* transformation, both results are comparable regardless of the loss existency. However, for *B to A* transformation, cycle consistency loss significantly improves the output of the generator. (##) Part 3 Bells & Whistles Here I report the analysis in extra credits. ------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.1 Patch discriminator** The first bells & whistles that I tried is patch discriminator.
*3.1.1. Introduction* For the discriminator that I mentioned above outputs probability of the image to be real from all image pixels. However, patch discriminator outputs the probability from the smaller sub-regions of the image (i.e., image patches). The intuition is that, the image pixels far apart from each other are independent each other. Furthermore, patch-wise discrimination helps the generator and discriminator more sensitive to the detailed texture of the image.
*3.1.2. Implementation* My implementation of Patch discriminator is as below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class PatchDiscriminator(nn.Module):
    """Bells & Whistle 1"""
    def __init__(self, norm='instance'):
        super(PatchDiscriminator, self).__init__()
        
        K = 4
        S = 2
        
        self.conv1 = conv(3, 32, K, S, get_padding(32, K, S), norm=norm)
        self.conv2 = conv(32, 64, K, S, get_padding(16, K, S), norm=norm)
        self.conv3 = conv(64, 128, K, S, get_padding(8, K, S), norm=norm)
        self.conv4 = conv(128, 1, K, S, get_padding(4, K, S), norm='none')
 
    def forward(self, x):
 
        out = F.relu(self.conv1(x))
        out = F.relu(self.conv2(out))
        out = F.relu(self.conv3(out))
        out = self.conv4(out)
 
        return out
Colored by Color Scripter
cs

*3.1.3. Qualitative results* ![figure [patch_discriminator_results]: (A to B) DC discriminator](report/Figure14_1.png) ![figure [patch_discriminator_results]: (A to B) Patch discriminator](report/Figure14_2.png)
![figure [patch_discriminator_results]: (B to A) DC discriminator](report/Figure15_1.png) ![figure [patch_discriminator_results]: (B to A) Patch discriminator](report/Figure15_2.png) Figures 27-30 compare the qualitative output of Cycle-GAN when using DC discriminator and Patch discriminator. For transferring *A to B*, both results seem comparable each other. However, for *B to A* transformation, using Patch discriminator shows enormous enhancement in the quality of output. ------------------------------------------------------------------------------------------------------------------------------------------------------------ **3.2 Animated videos** Here I show interesting animated gif files that dynamically illustrate how my networks are trained. The gif files are generated using *ffmpeg* library.
*3.1.3. DC-GAN* ![**Video 1**: *grumpifyA* DC-GAN](report/gif1.gif) ![**Video 2**: *grumpifyB* DC-GAN](report/gif2.gif) ![**Video 3**: (A to B) Cycle-GAN](report/gif3.gif) ![**Video 4**: (B to A) Cycle-GAN](report/gif4.gif)