Assignment 3

Deep Convolutional GAN

Padding Calculation

To compute the necessary padding for the convolutional layers in the DCGAN discriminator:

O = (X - F + 2B) / S + 1

where:

O = output dimension
X = input dimension
F = filter size
B = border padding
S = stride

Substituting F = 4 and S = 2, and setting O = X / 2, we solve for B:

B = 1

Thus, to ensure proper downsampling, each convolutional layer should apply a padding value of 1.

Losses

During GAN training, the discriminator's total loss approaches 0, while the generator's total loss approaches 1. This indicates that the discriminator is unable to distinguish between real and generated images, and the generator produces images that are indistinguishable from real ones.

Discriminator Loss

Vanilla

Vanilla with diffaug

Generator Loss

Vanilla

Vanilla with diffaug

Generated Samples

Vanilla [800 iters]

Vanilla [6400 iters]

Vanilla with diffaug [800 iters]

Vanilla with diffaug [6400 iters]

Diffusion Model

Generated Samples

Sample 1

Sample 2

Sample 3

Sample 4

Comparison with DCGAN

DCGAN is a type of GAN that consists of a generator and a discriminator engaged in a min-max optimization game. The generator learns to produce realistic samples, while the discriminator learns to distinguish between real and fake samples. In contrast, DDPM follows a probabilistic approach, where it gradually adds noise to an image during training and learns to reverse this process to generate high-quality samples.

From my experiments, I observe that DCGAN suffers from instability which is caused by the adversarial nature of GANs. On the other hand, DDPM is more stable, as it optimizes a well-defined likelihood function, ensuring consistent training, better sample quality and diversity.

In terms of computational requirements, DCGAN is more lightweight and can be trained relatively quickly. DDPM, however, is computationally expensive, requiring a large number of steps during both training and inference.

CycleGAN

Generated Samples [Grumpy Cat]

Without Cycle Loss

1. 1000 Iterations

X-Y

Y-X

2. 10000 Iterations

X-Y

Y-X

With Cycle Loss

1. 1000 Iterations

X-Y

Y-X

2. 10000 Iterations

X-Y

Y-X

Generated Samples [Apple Oranges]

Without Cycle Loss

1. 1000 Iterations

X-Y

Y-X

2. 10000 Iterations

X-Y

Y-X

With Cycle Loss

1. 1000 Iterations

X-Y

Y-X

2. 10000 Iterations

X-Y

Y-X

Comparison

When CycleGAN is trained with cycle consistency loss, the generated images maintain structural integrity, meaning that when an image is translated to the target domain and back to the source domain, it closely resembles the original. This results in translations that look realistic, with well-preserved details, textures, and content alignment across domains.

In contrast, training without cycle consistency loss leads to visually inconsistent and unstable translations which includes severe distortions, loss of important details, and unnatural artifacts in the generated images

Bells and Whistles

1] Generating Images using pre-trained diffusion model

Model Details

Model Version: Stable Diffusion v1-4 from CompVis

Steps: Set at 50 steps. This parameter controls how many iterations the model performs to progressively denoise and refine the generated image. While higher numbers typically improve image quality, they increase computation time. The default recommended number of inference steps for Stable Diffusion is around 50, providing a good balance between quality and computational efficiency.

Guidance Scale: Set at 7.5. The guidance scale (also known as CFG scale) determines how strictly the generated images follow the provided text prompt. A value around 7 to 7.5 is commonly used as it offers a good balance between prompt adherence and creative flexibility.

Image Resolution: Defaulted to 512x512 pixels, matching the resolution at which the model was fine-tuned

Generated Samples

Sample 1

Sample 2

Sample 3

Sample 4

2] CycleGAN with DCDiscriminator

DCDiscriminator evaluates an image at once, making it effective at capturing global structures but often weak at enforcing local consistency. As a result, images generated using DCDiscriminator have smoother textures but lack fine details, leading to blurry and oversimplified outputs.

In contrast, PatchDiscriminator divides an image into smaller patches and evaluates them independently, enforcing high-frequency details more effectively. This localized supervision helps capture textures and sharp edges more precisely, making the generated images more realistic.