Zijie Li (zijieli@andrew.cmu.edu)
Tianqin Li (audit, tianqinl@andrew.cmu.edu)
Learning useful representations without supervision is an interesting topic and remains a key challenge in machine learning. VQ-VAE [1] leveraged Variational Auto-Encoder to learn a powerful discrete representations in the latent space from the training data and alleviates the issue of posterior collapse that exists in many VAE variants. Motivated by the concept of neural discrete representation learning, we explore and propose a new mechanism for learning discrete representation to improve the performance of image systheis in VAE framework. Our method different from VQ-VAE in two key ways: First, we do not maintain the dictonary in the latent space. Instead we extract information from learned dictionary in the middle of the image generator/decoder. Second, we use attention to retreive the embedding from the learned dictionary instead of vector quantisation.
Using backpropagation and gradient to directly update vector embeddings inside memory bank has two downsides. First, we have to store the computation graph of attention in the memory bank which will make the training of model more memory-intensive (espeically when we have a larger size of memory bank). Second, the vector embeddings inside the memory bank (initialized randomly) might not be trained as fast as the generator. Hence, we also test out another update mechanism, where we stop the gradient before attention, and use momentum mechanism to update the vector embeddings. Given a feature map \( \mathbf{f}\), we find the closest vector embedding \(\mathbf{\theta}_i\) in the memory bank and update it as: $$\mathbf{\theta}^{'}_i \leftarrow (1-m)\mathbf{\theta}_i + m\mathbf{f}$$
We implement our methods under the VAE-GAN [3] framework, where we have an encoder, a decoder/generator and a discriminator. We implement the encoder with ResNet18, and use BigGAN [4] as the main architecture of decoder and discriminator with non-local block switch to memory bank. Our BigGAN implementation is based on the official PyTorch implementation of BigGAN from: https://github.com/ajbrock/BigGAN-PyTorch. We train and evaluate three variants of our method - a baseline model with exact same generator as BigGAN, a vanilla version of the memory bank where no cluster and momentum update is applied and a version with both the cluster mechanism and momentum update. For the vanilla memory bank, the size of the bank is 512, with each embedding's dimension as 512. For the other version, the bank contains 20 clusters, with 100 vectors in each cluster. We adopt a combination of GAN loss and VAE loss to train the model. The GAN loss is a hinge loss same as original BigGAN. The VAE loss comprises an L1 reconstruction loss and standard ELBO. We train all the model for 100k iterations on CelebA dataset with 128 x 128 resolution. (Notice: as the author of BigGAN suggests, BigGANs ususally reach their optimal performance after 150k ~ 200k iters, but we cannot afford such a long training period in this project :( ) We use Frechet Inception Distance (based on Pytorch-FID: https://github.com/mseitzer/pytorch-fid) to evaluate the generated image. For each task, we generate 50k images to evaluate FID.
|
|
|
|
|
|
|
|
|
|
|
In this project we enhance the performance of normal VAE framework by leveraging the concept of attention and dictonary learning. We propose an external attention mechanism and try out two strategies to reduce the computation costs. Our studies show that, maintaining an external memory bank to learn useful representation from training data can enhance the overall performance and robustness of the model. In addition, we find out that using gradient to update the vector embeddings inside the memory bank turn out to be a more effective way to learn the embeddings.
[1] Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu: Neural Discrete Representation Learning, https://arxiv.org/abs/1711.00937
[2] Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier: Efficient Content-Based Sparse Attention with Routing Transformers, https://arxiv.org/abs/2003.05997
[3] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther: Autoencoding beyond pixels using a learned similarity metric, https://arxiv.org/abs/1512.09300
[4] Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy: Explaining and Harnessing Adversarial Examples, https://arxiv.org/abs/1412.6572