11-747 Learning-based Image Synthesis George Cazenavette and Manuel Rodriguez Ladron de Guevara
Following the recent MLP-based model released by Google, MLP-Mixer (Tolstikhin et al. 2021), where traditional
convolutional neural networks (CNNs) are substituted by simpler
multi-layer perceptron (MLP) layers, we present MixerGAN, a generative model for unpaired image-to-image
translation. MixerGAN replaces the core ResNet (He et al., 2015) blocks from the original model (Zhu et al., 2017)
by simpler and more efficient MLP blocks. MixerGAN demonstrates competitive performance on benchmark datasets for
unpaired image-to-image translation.
The research community has been making great efforts to improve image-to-image translation models. Since the release of Pix2pix (Isola et al., 2016) and CycleGAN (Zhu et al., 2017) papers, researchers have been looking at alternatives to CNNs to make better models. In language, the transformer model (Vaswani et al., 2017), based on self-attention mechanisms, quickly substituted recurrent neural networks and became the de facto solution for language models.
Similarly, and while CNNs remain the go-to model for computer vision, researchers have been trying to complement or even substitute CNNs with self-attention mechanisms. However, self-attention mechanisms are extremely inefficient in vision problems because of their quadratic complexity in the number of pixels, making them invalid for high-resolution inputs. To this end, we propose MixerGAN, an mlp-based model for unpaired image-to-image translation.
MixerGAN substitutes the standard residual blocks in complex generative models by simpler and efficient MLP layers. We prove that CNNs aren't entirely necessary for a correct generative pipeline. In fact, we show that some results on benchmark datasets are even better than those in the original CycleGAN paper. MixerGAN is trained under an unsupervised setting, with unpaired data. Specifically for this project we test apple2orange, horse2zebra and summer2winter datasets.
Generative Adversarial Networks (GANs), introduced by Goodfellow et al., in 2014, have made an incredible progress over the years, becoming standard models for representation learning, style transfer, super-resolution tasks, and image-to-image translation. Image-to-image translation is the task of learning a mapping between images from two different domains. Given domain X and domain Y, we learn a mapping from X to Y and from Y to X. There are two broad categories depending on the availability of data. The first one is learning a supervised mapping from paired images, assuming we have a dataset with such paired samples. An alternative to this approach when no paired data is available is to undertake a perhaps more challenging unsupervised approach, similar to Zhu et al.'s CycleGAN model.
Inspired by the success of attention mechanisms in language models, the vision community has made great contributions to bring attention mechanisms to vision tasks. Attention-based models have gained popularity in a variety of computer vision tasks including image classification or image segmentation. Attention improves the performance by encouraging the model to look at salient features of the input that will likely make the model improve. A caveat with attention models in vision tasks is that attention activations are quadratic in the number of inputs (pixels), making it extremely computationally inefficient for high resolution inputs.
Specific to image-to-image translation problems, there has been some attempts to bring such attention mechanisms to generative models. Not all of them understand attention in the same way, so there is no rigid attention understanding on the following related work.
AttentionGAN by Tang. et al. (2020), understand attention as a series of background, foreground and content masks, to help the generator focus on the foreground objects, while leaving background scenery as in the original source. While this approach generates good resulting images, it involves the generation of multiple sets of masks in a perhaps unnecessary convoluted architecture. In an more faithful way to the original attention mechanisms MMA-CycleGAN by Ji. et al., (2020) propose the use of multi-head mutual-attention that allows range dependency modelling between two image domains, claiming to obtain similar results as the original CycleGAN in a shorter time. SPA-GAN by Emami. et al. (2020), propose a novel spatial attention mechanism in which the discriminator computes an attention map that is fed back into the generator, to help focus more on the discriminative regions between source and target domains.
These approaches either introduce complex models or less efficient models than the CycleGAN approach. To this end, we present a simpler and more efficient model that considers long-distance relationships between pixels and performs better than prior work.
Our setup is the same as in CycleGAN by Zhu et al. That is, we want to learn a mapping functions between two domains, X and Y. Following the original model, our setup consists of two generators G_XtoY and G_YtoX, and two discriminators D_X and D_Y. The generators take in an image from one domain and translate them to the other domain. Likewise, discriminators take to distinguish between a real and a translated image from its corresponding discriminator domain. We have two objectives, an adversarial loss and a cycle consistency loss. We also experiment with a third loss, a perceptual loss, that helps achieve better reconstructions.
We use a total of two losses and we ablate with a third loss. The three losses are adversarial, cycle consistency, and content loss.
We use adversarial losses to each mapping function. For a minibatch of samples ${x^{(1)}, x^{(2)}, ..., x^{(m)}}$ from domain X and ${y^{(1)}, y^{(2)}, ..., y^{(n)}}$ from domain Y, we compute the discriminator loss as the sum of the following two losses: $$ L_{real}^{D} = \frac{1}{m} \sum^m_{i=1} (D_X(x^{(i)} -1)^2 + \frac{1}{n} \sum^n_{j=1} (D_Y(y^{(j)} -1)^2$$ $$L_{fake}^{D} = \frac{1}{m} \sum^m_{i=1} (D_Y(G_{XtoY}(x^{(i)})))^2 + \frac{1}{n} \sum^n_{j=1} (D_X(G_{YtoX}(y^{(j)})))^2 $$
Likewise, the generators loss are estimated as follows: $$ L_{G_{YtoX}}= \frac{1}{n} \sum^n_{j=1} (D_X(G_{YtoX}(y^{(j)}) -1)^2 + L_{cycle}^{YtoXtoY}$$ $$ L_{G_{XtoY}}= \frac{1}{m} \sum^m_{i=1} (D_Y(G_{XtoY}(x^{(i)}) -1)^2 + L_{cycle}^{XtoYtoX}$$ where $L_{cycle}$ is the L1 loss between the reconstructed image and the original image.
We introduce a third loss based on content reconstruction. Content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as $X^l$ and that of target content image C as $C^l$. The content loss is defined as the squared L2-distance of these two features: $$L_{content}(\overrightarrow{x}, \overrightarrow{c}, l) = \frac{1}{2}\sum_{i,j}(X_{ij}^l - C_{ij}^l)^2$$ We can select at what level within the VGG-19 network we extract features to represent content. It is known that higher layers capture content better than lower layers, where the content reconstruction at these levels is almost perfect.
Our experiment substitutes convolutions blocks by mlp blocks. While there are still some convolution layers in the model for downsampling and upsampling, the core translation segment of the model, 9 residual blocks in the original paper comprised of convolution layers, instance normalization and relu activation are substituted by 9 MLP-mixer blocks. The idea behind the Mixer block is to separate the per-location (channel mixing) operations and cross-location (token mixing) operations with the use of MLPs.
The overall model's architecture is shown in the figure above. The generator comprises the following parts: stem, donwsample, projection, translation, de-projection and upsample. We maintain the original CycleGAN convolutional blocks in all but the translation step. Here, instead or resnet blocks, we propose the MLP-mixer block.
Each mixer block takes a sequence S of patches, projected to a hidden dimension C. This results in a 2-dimensional input table where $X\in{R^{S x C}}$. The same projection matrix is used for all patches, and each MLP-mixer block is comprised of two sub-blocks: a token mixing block and a channel mixing block. We follow Tolstikhin et al.'s architecture where layernorm is applied to the input of each sub-block, followed by GELU non-linearity and skip connections.
MLP-mixer blocks are more efficient than resnet blocks, with 7x fewer parameters. We show that by correctly organizing channels and tokens, there is no need of using conv layers in the translation. For a feature map of 16x16 and 512 channels, the mixer block has 655,360 parameters, while that a convolutional block has 4,718,592 parameters. With a correct downsampling process to get a reasonably small feature map, the mixer block remains more efficient than the convolutional counterpart.
As this is a new model, we have been testing the effect of the translation mapping fixing the architecture of the discriminator. We leave it for next steps of this project.
For this analysis, consider all layers to be "isotropic" such that their inputs and outputs are of the same dimensions.
As noted in the seminal work by Tolstikhin et al., the MLP-Mixer is, at its core, a convolutional neural network with very specific architectural hyper-parameters. As such, the MLP-Mixer can exploit the existing GPU architectures and implementations that allow convolution operations to be performed with extreme efficiency whereas attention-based networks are currently throttled by the speed at which GPUs can perform the un-optimized attention operation.
Furthermore, the MLP-Mixer and transformer blocks differ in their usage of memory. Both the transformer and MLP-mixer contain a channel-mixing MLP, so we will focus on the transformer's self-attention operator and the mixer's token-mixing MLP for comparison. For a representation with $n$ tokens and $c$ channels, the self-attention operator of the transformer block has $\mathcal{O}(c^2)$ parameters while the token-mixing MLP of the mixer block has $\mathcal{O}(n^2)$ parameters. However, the main memory sink of the self-attention block comes in the intermediate activations necessary for back-propagation. For a batch size $b$, the self-attention module uses $\mathcal{O}(bn^2 + bnc)$ intermediate activation floats while the token-mixing MLP uses $\mathcal{O}(bnc)$. The additional $bn^2$ term in the memory usage of the self-attention module is what makes transformer models prohibitive for domain with a large number of tokens, such as visual data.
A vanilla residual block consisting of two convolutional layers with kernels of size $k\times k$ would only have $\mathcal{O}(k^2c^2)$ parameters and use $\mathcal{O}(bnc)$ intermediate activation floats. Clearly, a vanilla residual block has the least memory usage and parameter count, but it lacks the capacity to account for long-range relationships between tokens as described above.
We first compare results on the summer2winter yosemite dataset using the original CycleGAN model and our proposed MixerGAN model. For fairness, we run both models in the same machine with the same settings. All models are trained with image input size 128, batch size of 16, Adam optimizer with a learning rate of 0 .003. We use patch discriminator from the original CycleGAN work and train for 20,000 iterations. To ensure a fair comparison, we reproduce the CycleGAN code with the help of previous assignments and they official pytorch repo. Both, resnet CycleGAN and MLP-mixer CycleGAN use 9 translation blocks.
In general, we see how our MixerGAN sligthly outperforms the original CycleGAN. We can see how in the first two images, resnet blocks impose too much saturated blue on the images, whereas our model it's perhaps more nuanced on the color. In the translation from winter to summer, our model seems to also pick up a better brightness than the resnet blocks. We can see in the second to last row, how resnet renders some greenery on the water, whereas our model is able to capture a bit better.
Results on apple2orange dataset are generally pretty good due to the simplicity of the dataset.
This dataset, however, is a bit more challenging for our network to translate properly due to the complexity of the images.
We have presented a new generative model based on MLP blocks. We argue that with a correct downsampling process and a proper layout of channels and tokens, the core resnet block can be substituted by a more efficient and simpler MLP-based block. We have shown that our qualitative results from our current model slightly outperform the original CNN-based CycleGAN model in the winter2summer dataset. However, we acknowledge that the model is still in progress and more tuning is necessary to avoid artifacts as those we see on the horse2zebra dataset.
For next steps, we plan on substituting the PatchGAN discriminator in the original CycleGAN model by a discriminator that is solely based on MLPs, as it has been proven to show very good results on discriminative tasks in the seminal work by Tolstikhin et al. We are also running ablation studies to empirically understand what parts of the original CNN-based CycleGAN model are necessary to ensure a good performance.