In this project, we design a novel position-aware image generator (PosGen), which combines a convolutional image generator design with a convolution-free output module that synthesizes each pixel independently. We focus on semantic image synthesis, which is a image generation task that generates photorealistic RGB images from semantic segmentation results. We then extend the task of semantic image synthesis to ramp up the output resolution thanks to our pixel-wise position-sensitive design. We experiment our PosGen on COCO-Stuff [2] & Cityscapes [3] datasets, demonstrating both the effectiveness and some interesting observations we get from our model.
!pip install -q mediapy
import mediapy as media
import matplotlib.pyplot as plt
import glob
import numpy as np
import cv2
Our proposed PosGen generator combines a convolution-based module at the earlier layers to model low level features, with an implicit positon-variant decoder at the later layers.
To be specific, our generator will follow the design of CIPS [1] and SPADE [4]. In the beginning, it keeps using the original design from [4] that utilizes stacked SPADE residual block structure, which performs well on semantic-to-RGB image generation task, while at later layers PosGen regresses from image pixel coordinates and their Fourier features to an RGB colored image output. By doing so, our final predict images are pixel-wise unconditional, meaning that each pixel is independently produced by its own positional and semantic features.
The following figure better explains the structure of our generator design. As is depicted in the figure, the input semantic segmentation graph first goes through several SPADE convolutional blocks. After that, we get semantic features and we transform them via a Fourier transformation layer, as well as injecting position coordinate features (i.e. pixel locations normalized by the input spatial dimensions, e.g. $(-1, -1)$ represents the upper-left corner pixel). We then split those feature maps pixel by pixel and feed them into positional aware, conditionally independent FC layers to get final RGB pixel output. Finally, we put all pixels together to form an RGB image as our final synthesized prediction. The FC layers are shared and parallelly computed for all pixels.
image = media.read_image('figures/structure.png')
media.show_image(image, title='generator structure')
generator structure |
Due to the pixel-wise independent design for the latter part of our generator, each pixel can be independently generated and put to the final output. We therefore come up with a way to ramp up the resolution of our output RGB images without the need of high-res inputs. As is shown in the strcuture below, since the output size is determined by position coordinate feature map, it is natural to increase the size of it to be a high-res position feature map, thus increasing the output dimentions as well consequently. For segmentation feature map, we simply interpolate it bilinearly to match the size of the enlarged coordinate feature map.
image = media.read_image('figures/highres_tructure.png')
media.show_image(image, title='generator structure for high resolution output')
generator structure for high resolution output |
For discriminator, since this is not our focus in this project, we simply borrow the discriminator from SPADE [4] to accommodate our needs.
We implement our PosGen generator based on released SPADE [4] pipeline. For both trainings(COCO-Stuff [2] & Cityscapes [3]), we keep all hyperparameters the same as original used in [4]. For pixel-wise linear layers, we replaced the last convolution from the original SPADE with 3 FC layers. Since FC layers are shared across all pixels, the total number of parameters in the generator network is actually reduced to approximately 97.5M. We use the same generator across all trainings.
We show the results on the Cityscapes [3] dataset after 50 epochs of training. For each generated RGB image, we also show the semantic segmentation and the groundtruth RGB image associated with it. First of all, we notice that all synthesized RGB images are photorealistic to a great extent, with some minor artifacts in some of them. This indicates that our positional and pixel-wise feature encodings are highly effective. We can also conclude that our model does not memorize the groundtruth image because the synthetic results can sometimes differ from the groundtruth RGB images in terms of their corresponding objects (such as the color of cars, background buildings, people's clothes) but they are in accordance with their corresponding segmentation inputs. This is a positive sign that our model has a good generalizability.
We do notice that some artifacts, such as the curvy lane mark on the road, are a little bit unrealistic. We argue that it may due to the fact that the input coordinates for adjacent pixels are similar and their segmentation features are similar as well because they are positionally close to each other, yielding similar outputs. We may solve this by further upgrading our feature inputs for FC layers.
The visualization results are as follows.
def sub_img(img, row_idx, col_idx, height=256, width=512):
return img[row_idx * height:(row_idx + 1) * height,
col_idx * width:(col_idx + 1) * width]
city_seg = media.read_image('figures/epoch050_iter147104_input_label.png')
city_gt = media.read_image('figures/epoch050_iter147104_real_image.png')
city_pred = media.read_image('figures/epoch050_iter147104_synthesized_image.png')
for i in range(3):
for j in range(3):
images = {
'segmentation input': sub_img(city_seg, i, j),
'groundtruth RGB': sub_img(city_gt, i, j),
'generated RGB': sub_img(city_pred, i, j),
}
media.show_images(images, columns=3, border=True)
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
Due to the limited training time, we show the results on the COCO-Stuff [2] dataset after 25 epochs of training. For each generated RGB image, we also show the semantic segmentation and the groundtruth RGB image associated with it. Once again, we show that our PosGen model is capable of generating complex photorealistic scenes and has a good generalizability instead of simply copying and pasting the groundtruth data.
The following results are produced at only half way of our training. The original plan is to train the model for 50 epochs, we anticipate a continuing performance gain in terms of generated details and photorealism as the training goes.
Here are those synthesized images.
coco_seg = media.read_image('figures/epoch025_iter117208_input_label.png')
coco_gt = media.read_image('figures/epoch025_iter117208_real_image.png')
coco_pred = media.read_image('figures/epoch025_iter117208_synthesized_image.png')
for i, j in {(1, 0), (1, 3), (2, 0), (2, 1), (2, 2), (3, 1), (8, 3), (9, 1), (9, 2)}:
images = {
'segmentation input': sub_img(coco_seg, i, j, height=256, width=256),
'groundtruth RGB': sub_img(coco_gt, i, j, height=256, width=256),
'generated RGB': sub_img(coco_pred, i, j, height=256, width=256),
}
media.show_images(images, columns=3, border=True)
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
segmentation input | groundtruth RGB | generated RGB |
As is mentioned in 2.2, our PosGen is capable of generating outputs that have a higher resolution than the input semantic segmentation labels thanks to its pixel-level independence. Hence, we selected some segmentation inputs from COCO-Stuff [2] dataset, which are 256x256. We then try to produce RGB synthesized outputs by providing coordinates corresponding to 512x512 and interpolating the semantic features by a factor of 2.
Results are shown as follows. Among them there are several interesting ones. For example, it changes the words on the STOP sign. It swaps a vegan pizza with another one likely topped by pork sausages. It imagines a picture that several soccer players running on a baseball field. Results are laughable but very cool at the same time.
pred_list = glob.glob('figures/pred512/*.png')
for pred in pred_list:
filename = pred.strip().split('/')[-1]
seg_img = media.read_image('figures/seg2017/' + filename)
gt_img = media.read_image('figures/val2017/' + filename[:-4] + '.jpg')
gt_img = cv2.resize(gt_img, dsize=(256, 256), interpolation=cv2.INTER_CUBIC)
images = {
'segmentation input & groundtruth': np.vstack((seg_img, gt_img)),
'high-res RGB synthesis': media.read_image(pred),
}
media.show_images(images, columns=2, border=True)
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
segmentation input & groundtruth | high-res RGB synthesis |
As is illustrated in above sections, our PosGen generator works well for semantic image synthesis and can generate high-res images thanks to its position-variant output layers.
To the best of our knowledge, we are the first to generate cross-domain images (i.e. from semantic images to photorealistic images), in the fashion of utilizing an image generator based on a non-spatial design. Unlike other previous works that heavily invested in spatial feature representation to improve the performance of their corresponding GAN pipeline [5, 6, 7], our exploration is considered to be novel and experimental.
Due to the short timespan and limited computational resources for this project, we regret to let many desired experiments undone as well as many of our ideas un-tried. These include the exhaustive investigation into the quantitative metrics (e.g. mIOU, FID) of our results, using only pixel-wise layers for the entire generator, etc.
[1] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis, 2020.
[2] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[4] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially adaptive normalization.CoRR, abs/1903.07291, 2019.
[5] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
[6] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 7354–7363.PMLR, 09–15 Jun 2019.
[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.