**When Cats meets GANs** Student name: Abhishek Pavani (#) Introduction In this assignment,I implemented 2 GAN network architectures along with denoising diffusion model. In the first part, I implemented a specific type of GAN designed to process images, called a Deep Convolutional GAN (DCGAN). I used DCGAN to generate grumpy cats from samples of random noise. In the second part, I implemented a more complex GAN architecture called CycleGAN for the task of image-to-image translation. In particular, I trained CycleGAN to convert between different types of two kinds of cats (Grumpy and Russian Blue), and between apples and oranges. (#) Deep Convolutional GAN (DCGAN) We will start by discussing the application of a Deep Convolution GAN (DCGAN) for generating images of cats. The generator in this setup utilizes a sequence of upsampling operations, followed by a convolutional layer (instead of a transconvolutional layer). It accepts a latent vector z, sampled from a Gaussian distribution, and produces an image of a cat. Meanwhile, the discriminator consists of a series of convolutional layers that take in a cat image and output a single value indicating whether the image is real or fake, i.e., generated by the generator. Network architecture can be seen below (##) Generator (##) Discriminator (##) Implementation details: 1. **Padding**: In each of the convolutional layers shown above, we downsample the spatial dimension of the input volume by a factor of 2. Given that we use kernel size K = 4 and stride S = 2, padding is calculated as follows Out_channels = (In_channels + 2P - K)/S + 1 Substituing the values of K = 4 and S = 2; we get; P(Out_channels, In_channels) = Out_channels - In_channels/2 + 1 Using the above formula, we compute padding for each layer. **It is found that P = 1 for all layers except the last layer where P = 0 ** In order to train the networks, we need to compute their respective losses. For the discriminator, the loss is determined by taking the average of the losses calculated for real and fake images of cats. The real loss is computed when the discriminator misclassifies a real image as fake, while the fake loss is computed when it misclassifies a generated image as real. The higher the real loss, the more the discriminator thinks a real image is fake, while a higher fake loss indicates that the discriminator thinks a fake image is real. The generator's loss is determined by measuring how effectively it deceived the discriminator. Mean-squared error loss is used for all the losses. **Generator loss** **Discriminator loss** In order to prevent the discriminator from overfitting to the dataset and improve the generator's performance, we augmented our dataset. This was done by scaling the images to 1.1 times the specified image size, randomly cropping them to the specified size, and then randomly flipping them horizontally. Additionally, we experimented with Differential Augmentation techniques, which involve applying various differentiable augmentations such as random brightness, saturation, contrast, translation, and cutting out sections of the image. Our GAN was trained on the Grumpy Cats B dataset for 500 epochs, using a batch size of 16 and instance normalization. We conducted several experiments, including training without any augmentation, with regular augmentation only, with only Differential Augmentation, and with both regular and Differential Augmentation. The results are presented below. Without any augmentation, the generator produced poor quality images, with all the outputted cat images appearing the same. Using only regular augmentation, the cats looked more diverse, but the images were still somewhat blurry. When using only Differential Augmentation, the cats appeared smoother, but they all seemed to be looking in the same direction. The most diverse outputs were obtained when using both regular and Differential Augmentation. (##) LOSS CURVES **Basic Augmentation** | Generator Loss | Discriminator Loss| |----------------|-------------------| | | | **Deluxe Augmentation** | Generator Loss | Discriminator Loss| |----------------|-------------------| | | | **Basic Augmentation with diffaug** | Generator Loss | Discriminator Loss| |----------------|-------------------| | | | **Deluxe Augmentation with diffaug** | Generator Loss | Discriminator Loss| |----------------|-------------------| | | | As seen from the graphs, the loss decreases with iterations for the discriminator and generator loss increases over iterations. When the generator loss of a GAN (Generative Adversarial Network) increases while the discriminator loss decreases, it means that the generator is generating worse samples, but the discriminator is becoming better at distinguishing between real and fake samples. In a GAN, the generator tries to produce samples that are similar to the real samples, while the discriminator tries to distinguish between real and fake samples. The two models are trained simultaneously, with the generator trying to fool the discriminator and the discriminator trying to correctly identify real samples from the generated ones. If the generator loss increases, it means that the generated samples are becoming less similar to the real samples. On the other hand, if the discriminator loss decreases, it means that the discriminator is becoming better at distinguishing between real and fake samples. This could be because the discriminator is becoming more complex, the training data is becoming more diverse, or the generator is generating less realistic samples. In any case, it is important to monitor both the generator and discriminator loss during the training of a GAN to ensure that the generator is producing realistic samples and the discriminator is correctly identifying real and fake samples. If the generator loss continues to increase while the discriminator loss decreases, it may be necessary to adjust the training process to improve the overall performance of the GAN. **Basic Augmentation Results** | Real | Generated| |----------------|-------------------| | | | **Deluxe Augmentation Results** | Iteration 200 | Iteration 6400| |----------------|-------------------| | | | **Deluxe DiffAug Augmentation Results** | Iteration 200 | Iteration 6400| |----------------|-------------------| | | | When comparing the two images, there is a clear improvement in the image on the right. In the left image, the cat has a shadowy outline, but it lacks details and has incorrect colors. Additionally, there are noticeable artifacts such as color lines. In contrast, the image on the right looks very realistic with only minor artifacts. Overall, the image on the right is a significant improvement compared to the one on the left and looks much more natural and lifelike. (#) CycleGAN CycleGAN is a type of GAN that allows for image translation between two domains without the need for paired data. Similar to DCGAN, CycleGAN uses a generator to create images, but instead of starting with noise, it takes an input image from one domain and outputs an image in the other domain. (#) Generator (#) Discriminator The model architecture is comprised of two generator models: one generator (Generator-A) for generating images for the first domain (Domain-A) and the second generator (Generator-B) for generating images for the second domain (Domain-B). Domain-B -> Generator-A -> Domain-A Domain-A -> Generator-B -> Domain-B Each generator has a corresponding discriminator model (Discriminator-A and Discriminator-B). The discriminator model takes real images from Domain and generated images from Generator to predict whether they are real or fake. Domain-A -> Discriminator-A -> [Real/Fake] Domain-B -> Generator-A -> Discriminator-A -> [Real/Fake] Domain-B -> Discriminator-B -> [Real/Fake] Domain-A -> Generator-B -> Discriminator-B -> [Real/Fake] (#) Loss Functions (##) Discriminator Real Loss (##) Discriminator Fake Loss (##) Generator X-Y Loss (##) Generator Y-X Loss (#) Training with Patch discriminator (##) Training without cycle consistency Loss | X-Y (1000 iterations) | Y-X (1000 iterations)| |----------|---------|---| | | | (##) Training with cycle consistency Loss | X-Y (1000 iterations) | Y-X (1000 iterations)| |----------|---------|---| | | | | X-Y (10000 iterations) | Y-X (10000 iterations)| |----------|---------|---| | | | | X-Y (10000 iterations) | Y-X (10000 iterations)| |----------|---------|---| | | | (##) Training with and without cycle consistency loss | Without cycle consistency (10000 iterations) | With cycle consistency (10000 iterations)| |----------|---------|---| | | | | | | As seen above without cycle consistency loss, the color information is copied but the texture is all over the place. The resulting translations do not seem to be bijective. This means that there is no one-to-one mapping between the two domains, and some input images may have multiple possible output images. (##) DC Discriminator vs Patch Discriminator | DC Discriminator| Patch Discriminator| |----------|---------|---| | | | | | | A Patch Discriminator divides the input image into multiple patches and classifies each patch as real or fake. By using patch-based discrimination, the discriminator can better localize the differences between the real and fake images, leading to more detailed and fine-grained feedback to the generator. The output of the Patch Discriminator is a matrix of scores, with each element representing the classification score for a particular patch of the input image. In contrast, a DC Discriminator is a standard discriminator network that takes the entire input image and outputs a single classification score. This approach is less fine-grained than the Patch Discriminator since the entire image is classified as real or fake, rather than patch-by-patch. A Patch Discriminator is more fine-grained and can provide more detailed feedback to the generator, which can lead to better image quality. However, it requires more computation and memory resources due to the patch-based approach. A DC Discriminator is less fine-grained but requires less computation and memory resources. (#) Bells and Whistles (##) Denoising Diffusion Probabilistic Model I trained a denoising diffusion model to denoise and generate cat images. The training starts from adding noise to an image during the forward diffusion process and then remove noise during backward diffusion process. To denoise the image, I used a U-Net architecture as shown above | Iterations | Images | |-------|------| | Iteration 0 | | | Iteration 100 | | | Iteration 250 | | | Iteration 350 | | | Iteration 500 | | (##) Diffusion model on own images I found similar cat images that were in the dataset from the internet and used an off the shelf implementation of dream-booth and genrated the following **Cat-Image 1** **Cat-Image 2**