16-726 Project

Memory Bank for Robust Image Synthesis

Zijie Li (zijieli@andrew.cmu.edu)

Tianqin Li (audit, tianqinl@andrew.cmu.edu)

1. Overview

Learning useful representations without supervision is an interesting topic and remains a key challenge in machine learning. VQ-VAE [1] leveraged Variational Auto-Encoder to learn a powerful discrete representations in the latent space from the training data and alleviates the issue of posterior collapse that exists in many VAE variants. Motivated by the concept of neural discrete representation learning, we explore and propose a new mechanism for learning discrete representation to improve the performance of image systheis in VAE framework.

Our method different from VQ-VAE in two key ways: First, we do not maintain the dictonary in the latent space. Instead we extract information from learned dictionary in the middle of the image generator/decoder. Second, we use attention to retreive the embedding from the learned dictionary instead of vector quantisation.

2. Memory Bank

2.1 Attention with Memory

Pipeline of attention with memory bank

Inside a layer with memory bank, given a feature map $\mathbf{f}$ and a matrix $\mathbf{V}$ that stores all the vector embeddings inside the memory bank. We first calculate the attention score between incoming feature map and vector embeddings in the memory bank, then multiply the value (encoded vector embedding) with the attention score to get the output. $$\mathbf{o} = G(\mathbf{V})^T\text{softmax}(\Phi(\mathbf{f})^T\Theta(\mathbf{V}))$$ Where $G, \Phi, \Theta$ are implemented as 1x1 convolution.

2.2 Memory Bank with Cluster

Pipeline of cluster mechanism

As attention is a quadratic operator, the computational cost of attention with memory bank drastically increases when the size of memory bank scales up. To make calculation feasible after scaling up the size of memory bank, we test and try out the cluster mechanism [2]. We fist split the vector embeddings inside the memory bank into separate cluster. For each cluster, in addition to vector embeddings inside that cluster, we also calculate and store the averaged vector of that cluster. When doing attention with the feature map, we first calculate the attention score between feature map and averaged vector of each cluster. Then based on the attention score, we select the cluster that has the highest attention score with feature map and do attention with all the vector embeddings inside that cluster.

2.3 Stop Gradient and Momentum Update

Using backpropagation and gradient to directly update vector embeddings inside memory bank has two downsides. First, we have to store the computation graph of attention in the memory bank which will make the training of model more memory-intensive (espeically when we have a larger size of memory bank). Second, the vector embeddings inside the memory bank (initialized randomly) might not be trained as fast as the generator. Hence, we also test out another update mechanism, where we stop the gradient before attention, and use momentum mechanism to update the vector embeddings. Given a feature map $ \mathbf{f}$, we find the closest vector embedding $\mathbf{\theta}_i$ in the memory bank and update it as: $$\mathbf{\theta}^{'}_i \leftarrow (1-m)\mathbf{\theta}_i + m\mathbf{f}$$

3. Experiment

3.1 Implementation details

We implement our methods under the VAE-GAN [3] framework, where we have an encoder, a decoder/generator and a discriminator. We implement the encoder with ResNet18, and use BigGAN [4] as the main architecture of decoder and discriminator with non-local block switch to memory bank. Our BigGAN implementation is based on the official PyTorch implementation of BigGAN from: https://github.com/ajbrock/BigGAN-PyTorch.

We train and evaluate three variants of our method - a baseline model with exact same generator as BigGAN, a vanilla version of the memory bank where no cluster and momentum update is applied and a version with both the cluster mechanism and momentum update. For the vanilla memory bank, the size of the bank is 512, with each embedding's dimension as 512. For the other version, the bank contains 20 clusters, with 100 vectors in each cluster.

We adopt a combination of GAN loss and VAE loss to train the model. The GAN loss is a hinge loss same as original BigGAN. The VAE loss comprises an L1 reconstruction loss and standard ELBO. We train all the model for 100k iterations on CelebA dataset with 128 x 128 resolution. (Notice: as the author of BigGAN suggests, BigGANs ususally reach their optimal performance after 150k ~ 200k iters, but we cannot afford such a long training period in this project :( )

We use Frechet Inception Distance (based on Pytorch-FID: https://github.com/mseitzer/pytorch-fid) to evaluate the generated image. For each task, we generate 50k images to evaluate FID.

3.2 Results

Conditional Image Generation

Ground Truth Images	Baseline, FID: 52.843
Vanilla memory bank, FID: 43.469	Memory bank with momentum and cluster, FID: 54.437

The first evaluation task is conditional image generation, where we input the ground truth to the encoder to derive the latent code and then we reconstruct the image based on the latent code. From above results, vanilla memory bank has the best performance among the three models. Our main speculation is that, first, vanilla memory bank has more weights and parameters compared against normal attention in the baseline model. Second, compared to memory bank using momentum update, using gradient to update the vector embeddings should be less sensitive to the initialization. Recall that in momentum update, we accumulatively sum the feature map of current image and nearest vector embedding in the bank to update, which means if vector embeddings are poorly initialized, then the updated bank might not learn a good representation of the data.

Randomly Sample from Noise

Baseline, FID: 72.843

Vanilla memory bank, FID: 59.572

Momentum memory bank, FID: 74.002

The second evaluation task is image synthesis from random noise. We sampled a standard Gaussian noise and input it to the generator. Under this setting, the gap between vanilla memory bank and other models are more significant. An intuive explanation for this is that, by learning a set of abstract representation from the training data, vanilla memory bank can generate high quality result by querying the information inside the bank, even if the prior information from encoder is missed.

Robustness under Adversarial Attack

Ground Truth

Baseline,
FID: 74.452

Vanilla memory bank,
FID: 56.634

Momentum memory bank,
FID: 71.308

To further evaluate the robustness of each model, we using Fast Gradient Signed Method (FGSM) [4] to find "bad" latent variable. We first input the ground truth image to the encoder to derive a set of latent codes. Then we use reconstruction loss as objective and gradient ascent to perturb the latent code.
The results show that, when inputting "bad" latent variable, the quality of generated images drops significantly for baseline model and memory bank updated using momentum, while vanilla memory bank generates perceptually better results, which further indicates memory bank can be a effective way to enhance the robustness in the posterior inference.

4. Discussion

In this project we enhance the performance of normal VAE framework by leveraging the concept of attention and dictonary learning. We propose an external attention mechanism and try out two strategies to reduce the computation costs. Our studies show that, maintaining an external memory bank to learn useful representation from training data can enhance the overall performance and robustness of the model. In addition, we find out that using gradient to update the vector embeddings inside the memory bank turn out to be a more effective way to learn the embeddings.

References

[1] Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu: Neural Discrete Representation Learning, https://arxiv.org/abs/1711.00937
[2] Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier: Efficient Content-Based Sparse Attention with Routing Transformers, https://arxiv.org/abs/2003.05997
[3] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther: Autoencoding beyond pixels using a learned similarity metric, https://arxiv.org/abs/1512.09300
[4] Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy: Explaining and Harnessing Adversarial Examples, https://arxiv.org/abs/1412.6572