16-726 21spring Final Project

Image Generation via Independent Semantic Synthesis

Author: Zhe Huang (zhehuang)

Acknowledgement: Andrew Luo (afluo)


1. Introduction

In this project, we design a novel position-aware image generator (PosGen), which combines a convolutional image generator design with a convolution-free output module that synthesizes each pixel independently. We focus on semantic image synthesis, which is a image generation task that generates photorealistic RGB images from semantic segmentation results. We then extend the task of semantic image synthesis to ramp up the output resolution thanks to our pixel-wise position-sensitive design. We experiment our PosGen on COCO-Stuff [2] & Cityscapes [3] datasets, demonstrating both the effectiveness and some interesting observations we get from our model.

2. Methods

2.1. The overall design of PosGen generator

Our proposed PosGen generator combines a convolution-based module at the earlier layers to model low level features, with an implicit positon-variant decoder at the later layers.

To be specific, our generator will follow the design of CIPS [1] and SPADE [4]. In the beginning, it keeps using the original design from [4] that utilizes stacked SPADE residual block structure, which performs well on semantic-to-RGB image generation task, while at later layers PosGen regresses from image pixel coordinates and their Fourier features to an RGB colored image output. By doing so, our final predict images are pixel-wise unconditional, meaning that each pixel is independently produced by its own positional and semantic features.

The following figure better explains the structure of our generator design. As is depicted in the figure, the input semantic segmentation graph first goes through several SPADE convolutional blocks. After that, we get semantic features and we transform them via a Fourier transformation layer, as well as injecting position coordinate features (i.e. pixel locations normalized by the input spatial dimensions, e.g. $(-1, -1)$ represents the upper-left corner pixel). We then split those feature maps pixel by pixel and feed them into positional aware, conditionally independent FC layers to get final RGB pixel output. Finally, we put all pixels together to form an RGB image as our final synthesized prediction. The FC layers are shared and parallelly computed for all pixels.

2.2. Special design for generating high-res images

Due to the pixel-wise independent design for the latter part of our generator, each pixel can be independently generated and put to the final output. We therefore come up with a way to ramp up the resolution of our output RGB images without the need of high-res inputs. As is shown in the strcuture below, since the output size is determined by position coordinate feature map, it is natural to increase the size of it to be a high-res position feature map, thus increasing the output dimentions as well consequently. For segmentation feature map, we simply interpolate it bilinearly to match the size of the enlarged coordinate feature map.

2.3. About the discriminator

For discriminator, since this is not our focus in this project, we simply borrow the discriminator from SPADE [4] to accommodate our needs.

3. Results

3.1. Implementation details

We implement our PosGen generator based on released SPADE [4] pipeline. For both trainings(COCO-Stuff [2] & Cityscapes [3]), we keep all hyperparameters the same as original used in [4]. For pixel-wise linear layers, we replaced the last convolution from the original SPADE with 3 FC layers. Since FC layers are shared across all pixels, the total number of parameters in the generator network is actually reduced to approximately 97.5M. We use the same generator across all trainings.

3.2. Training & testing on the Cityscapes dataset

We show the results on the Cityscapes [3] dataset after 50 epochs of training. For each generated RGB image, we also show the semantic segmentation and the groundtruth RGB image associated with it. First of all, we notice that all synthesized RGB images are photorealistic to a great extent, with some minor artifacts in some of them. This indicates that our positional and pixel-wise feature encodings are highly effective. We can also conclude that our model does not memorize the groundtruth image because the synthetic results can sometimes differ from the groundtruth RGB images in terms of their corresponding objects (such as the color of cars, background buildings, people's clothes) but they are in accordance with their corresponding segmentation inputs. This is a positive sign that our model has a good generalizability.

We do notice that some artifacts, such as the curvy lane mark on the road, are a little bit unrealistic. We argue that it may due to the fact that the input coordinates for adjacent pixels are similar and their segmentation features are similar as well because they are positionally close to each other, yielding similar outputs. We may solve this by further upgrading our feature inputs for FC layers.

The visualization results are as follows.

3.3. Training & testing on the COCO-Stuff dataset

Due to the limited training time, we show the results on the COCO-Stuff [2] dataset after 25 epochs of training. For each generated RGB image, we also show the semantic segmentation and the groundtruth RGB image associated with it. Once again, we show that our PosGen model is capable of generating complex photorealistic scenes and has a good generalizability instead of simply copying and pasting the groundtruth data.

The following results are produced at only half way of our training. The original plan is to train the model for 50 epochs, we anticipate a continuing performance gain in terms of generated details and photorealism as the training goes.

Here are those synthesized images.

3.4. Synthesizing high-res images

As is mentioned in 2.2, our PosGen is capable of generating outputs that have a higher resolution than the input semantic segmentation labels thanks to its pixel-level independence. Hence, we selected some segmentation inputs from COCO-Stuff [2] dataset, which are 256x256. We then try to produce RGB synthesized outputs by providing coordinates corresponding to 512x512 and interpolating the semantic features by a factor of 2.

Results are shown as follows. Among them there are several interesting ones. For example, it changes the words on the STOP sign. It swaps a vegan pizza with another one likely topped by pork sausages. It imagines a picture that several soccer players running on a baseball field. Results are laughable but very cool at the same time.

4. Discussion

As is illustrated in above sections, our PosGen generator works well for semantic image synthesis and can generate high-res images thanks to its position-variant output layers.

To the best of our knowledge, we are the first to generate cross-domain images (i.e. from semantic images to photorealistic images), in the fashion of utilizing an image generator based on a non-spatial design. Unlike other previous works that heavily invested in spatial feature representation to improve the performance of their corresponding GAN pipeline [5, 6, 7], our exploration is considered to be novel and experimental.

Due to the short timespan and limited computational resources for this project, we regret to let many desired experiments undone as well as many of our ideas un-tried. These include the exhaustive investigation into the quantitative metrics (e.g. mIOU, FID) of our results, using only pixel-wise layers for the entire generator, etc.


References

[1] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis, 2020.

[2] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[4] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially adaptive normalization.CoRR, abs/1903.07291, 2019.

[5] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.

[6] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 7354–7363.PMLR, 09–15 Jun 2019.

[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.