Assignment #3 - Cats Generator Playground

Part 1: Deep Convolutional GAN

Implement the Discriminator of the DCGAN

1. Suppose the input image size is X, which is a multiple of 2, kernel width is K, padding size is P and stride S, the output size is:

\[ \text{Output Size} = \frac{(X - K + 2 \times P)}{S} + 1 \]

Therefore, if K=4, S=2, and we want the output size to be X ⁄ 2, we can solve the padding size P=1.

Experiment with DCGANs

1.Training Loss curves:

Discriminator loss curve of vanilla data preprocessing.

Generator loss curve of vanilla data preprocessing.

Discriminator loss curve of vanilla & differentiable augmentation data preprocessing.

Generator loss curve of vanilla & differentiable augmentation data preprocessing.

If GAN manages to train, regarding discriminator loss, in early stages it should reduce largely because it is trained first and can easily be trained to discriminate real images and noise. Then it might increase a little bit as the generator is trained to generate images that are closer to real images. Gradually the loss should be stable and ideally for any images the discriminator makes random guess. Regarding generator loss, it should decrease because generator learns to produce realistic fake images. Gradually the loss should be stable.

2.Data_Preprocess:

The following samples are produced by DCGAN with vanilla preprocessing and differentiable augmentation at 400 iterations and 6400 iterations.

We find that the images produced by the DCGAN trained for 6400 iterations are have higher quality than that of 400 iterations. Specifically, as the training goes on, the cats' faces becomes more identifiable and clearer, and there is less noise.

The following samples compare whether to use differentiable augmentation at 6400 iterations.

We find that images produced using differentiable augmentation demonstrate clearer cat faces. This is because vanilla data augmentation only works on the real images, while the differentiable augmentation preturbs both the real images and the generated images, which enables discriminators to discriminate real and fake images under the same image transformation.

Part 2: Diffusion Model

The following samples are produced by training Diffusion Models.

Performance of DDPM vs. DCGAN

1. The quality of images generated by Diffusion Models are better than images generated by DCGAN, as Diffusion Models synthesize images with clearer structures and more details.

2. Diffusion Models suffer from slower inference speed than DCGAN, but can sample more diverse images.

3. Diffusion Models sample more diverse and high quality images, while suffer from slow inference time. DCGAN has faster inference time, but suffer from mode collapse. Additionally, it is hard to train well both the discriminator and the generator.

Generate samples using a pre-trained diffusion model

I use a pre-trained Stable Diffusion 1.4 to generate the following samples:

Interestingly, I find that pre-trained Stable Diffusion sometimes does not understand the prompt well. The model only generates an apple when I ask it to generate an apple and an orange.

Part 3: CycleGAN

The following samples are produced by CycleGAN without consistency loss at 1000 iterations.

The following samples are produced by CycleGAN with consistency loss at 1000 iterations.

The following samples are produced by CycleGAN without consistency loss at 10000 iterations.

The following samples are produced by CycleGAN with consistency loss at 10000 iterations.

With the existence of cycle consistency loss, the generated images preserves the structure and shape of the source images. This is because by enforcing the translation of generated target domain images back to original source domain images, the generated images are constrained to have a similar structure as original images for easy network optimization. For example, without consistency loss, there are structure disruption on the cats' faces and the outlines of apples and oranges when there are multiple apples and oranges in the source domain images.

Improvement to the loss for CycleGAN

Besides pixel-wise intensity difference between reconstructed source domain image and real source domain image, we can also add LPIPS loss term into the cycle consistency loss. LPIPS loss aims to encourage the similarity between a pair of images in VGG network feature space. The improved cycle-consistency loss function for X->Y can be formulated as follows:

\[ L = \text{lambda_cycle}*L_\text{1}(\text{G_YtoX(fake_Y), images_X})+\text{lambda_lpips}*\text{VGG}(\text{G_YtoX(fake_Y), images_X}) \]

where lambda_cycle and lambda_lpips are set to 1 and 10. The following samples compare the use of LPIPS loss.

Russian blue -> Grumpy. Without LPIPS loss.

Russian blue -> Grumpy. With LPIPS loss.

Grumpy -> Russian blue. Without LPIPS loss.

Grumpy -> Russian blue. With LPIPS loss.

As we can see, incorporating LPIPS loss further preserves the structure and generate images with finer details.

DCDiscriminator vs. PatchDiscriminator

The following samples compare the training using DCDiscriminator with PatchDiscriminator.

Russian blue -> Grumpy. DCDiscriminator.

Russian blue -> Grumpy. PatchDiscriminator.

Grumpy -> Russian blue. DCDiscriminator.

Grumpy -> Russian blue. PatchDiscriminator.

Compared with generated images trained with PatchDiscriminator, the generated images trained with DCDiscriminator appear blurrier. This is because PatchDiscriminator refines features at the patch level rather than at global level, which helps preserve fine-grained details in each smaller patches.

Image Generation using Flow Matching

Flow Matching builds a probability path (p_t), where 0 ≤ t ≤ 1 from a known source distribution p₀ = p to the data target distribution p₁ = q, where each p_t is a distribution over R^d. After training, we generate a novel sample from the target distribution X₁ ~ q by (i) drawing a novel sample from the source distribution X₀ ~ p, and (ii) solving the ODE determined by the velocity field. We are trying to solve the following ODE:

\[ \frac{d φ(t,x)}{dt} = u(t,φ(t,x)) \]

u is a time-dependent velocity field and it determines a time-dependent flow φ where φ(0,x) = x. To learn a linear path, which means φ(t,x) = X_t = tX₁+(1-t)X₀, the objective function can be formulated as follows:

\[ L = E_{t,X_0,X_1}||u^\theta(t,X_t)-(X_1-X_0)||^2 \]

During sampling, we first randomly sample a point from p, then sample a series of t between 0 and 1 to solve ODE.

I train using the code here on our data for 1000 epochs. The generated samples are shown below.