1. Introduction

The task of unpaired image-to-image translation involves converting an image from one domain to another without a corresponding set of paired images for training. This area of research has gained significant attention in recent years due to its potential applications in various fields, such as style transfer [6], domain adaptation [11], image synthesis [2], and data augmentation [4]. Since this problem is ill-posed and can result in multiple possible translations, the primary challenge for unpaired image-to-image translation is to identify an appropriate assumption or constraint to regularize the translation process.

CycleGAN [14] is a well-known work that contributed to image-to-image translation through the introduction of a cycle-consistency loss. The follow-up work UNIT [10] expanded on this by proposing a shared latent space assumption, where images from different domains are mapped to the same latent vector. MUNIT [5] and DRIT [9] further improved the process by disentangling images into domain-invariant content codes and domain-specific style codes, allowing for the translation of images across multiple domains with style code variation.

Despite the significant progress made by existing methods on various benchmark datasets, two major challenges still exist. The first challenge is the presence of unwanted artifacts and distortions in the translated images, which results in poor perceptual similarity with both the content and style images. Even if the generated images' quality is acceptable, they often exhibit the second challenge: the output images tend to resemble the source image more than the target image (i.e. source proximity). To illustrate the limitations of previous state-of-the-art methods, we provide visual examples in Fig. 2. In the next section, we propose our method which aims to overcome these limitations.

2. Proposed Methods

2.1. Overview

This section begins by introducing the preliminaries and notations used in our proposed method. We then utilize the detailed information present in the output matrices of the discriminator [8, 1] to develop two novel loss functions: the Triangular Probability Similarity (TPS) loss and the Target Over Source (TOS) loss. These loss functions aim to address the limitations of previous methods, as discussed in the previous section. Finally, we present the complete set of loss functions used for training our model.

2.2. Preliminaries and Notations

Let us consider $https://latex.codecogs.com/svg.image?\inline W$ and $https://latex.codecogs.com/svg.image?\inline H$ as the width and height of the image, respectively. Besides, let's further suppose that $https://latex.codecogs.com/svg.image?\inline X$ represents the source image, $https://latex.codecogs.com/svg.image?\inline Y$ represents the target image, $https://latex.codecogs.com/svg.image?\inline Z$ represents the translated image, and $https://latex.codecogs.com/svg.image?\inline D$ represents the discriminator.

2.3. Triangular Probability Similarity (TPS)

In the manifold of the discriminator output matrix, image-to-image translation can be seen as a transportation process of an image from the source domain to the target domain. The quality of the translated images, denoted as $https://latex.codecogs.com/svg.image?\inline Z$ , can be evaluated by the amount of feature overlaps they share with both the source and target images, as well as the degree of unwanted artifacts present in them. As illustrated in Fig. 3, a ”good” translation should have a high feature overlap with the source and target images while minimizing the presence of artifacts, whereas a ”bad” translation will exhibit a lower feature overlap and more pronounced artifacts.

Figure 3. Intuitive visual explanation for Triangular Probability Similarity (TPS) loss.

Note that we propose the Triangular Probability Similarity (TPS) loss to help pull the translated images towards the target domain along the line segment connecting the source image $https://latex.codecogs.com/svg.image?\inline X$ and the target image $https://latex.codecogs.com/svg.image?\inline Y$ . This constraint promotes more feature overlaps between them, which reduces unwanted distortions and artifacts. By utilizing the Triangular Inequality Theorem, we can write the TPS loss as follows.

2.4. Target Over Source (TOS)

Although TPS reduces artifacts and distortions, it does not solve the source proximity issue. For instance, as seen in Fig. 4, two different generated images, $https://latex.codecogs.com/svg.image?\inline Z$ and $https://latex.codecogs.com/svg.image?\inline Z'$ , have the same TPS value because they are on the same level set. However, $https://latex.codecogs.com/svg.image?\inline Z'$ is a better option than $https://latex.codecogs.com/svg.image?\inline Z$ as it is closer to the target image, indicating stronger target proximity and weaker source proximity.

Figure 4. Intuitive visual explanation for Target Over Source (TOS) loss.

Drawing from the insight that TPS alone does not sufficiently address the source proximity issue, we introduce the Target Over Source (TOS) loss in order to further improve the translation quality. The TOS loss pulls the translated images towards the target images and pushes them away from the source images in the discriminator manifold, which reduces the source proximity and increases the target proximity. The TOS loss can be expressed mathematically as follows:

2.5. Overall Loss Functions

The overall loss functions used for training are listed below:

3. Experiments

3.1. Experimental Setup

Implementation Details Our model and other baseline models were trained on a single RTX 3090 GPU using Py-Torch [7]. We utilized the Adam [7] optimizer with $https://latex.codecogs.com/svg.image?\inline \beta_1$ = 0.5 and $https://latex.codecogs.com/svg.image?\inline \beta_2$ = 0.99. The training process consisted of 400 epochs, and a batch size of 1 was employed. During the first 200 epochs, a fixed learning rate of 2e-4 was used, while the subsequent 200 epochs utilized a linear decay schedule to gradually reduce the learning rate to 2e-5. Image patches of size 256 $https://latex.codecogs.com/svg.image?\inline \times$ 256 were used for training. We evaluated the proposed method on several benchmark datasets, including label2photo in cityscapes [3], horse2zebra, and cat2dog. Our approach was compared to other established methods, such as CycleGAN [14], CUT [12], and MoNCE [13]. To evaluate the performance, we used FID and conducted visual comparisons between the generated images.

3.2. Ablation Studies

We perform an ablation study on the proposed TPS and TOS modules by integrating them as a plug-and-play module into the state-of-the-art method MoNCE [13]. Specifically, MoNCE without TPS and TOS loss is equivalent to Ours without the two proposed modules. We then evaluate the performance of image-to-image translation qualitatively and quantitatively. As shown in Fig. 1, we observe that Ours (w/o TPS) exhibits peculiar artifacts around the upper part of the head and the neighboring region of the mouse, while Ours (w/o TOS) resembles a cat more than a dog (source proximity). In contrast, Ours with both proposed modules has fewer artifacts and distortions and is more similar to a dog (target proximity).

3.3. Quantitative Comparisons

Table 1 presents a quantitative comparison of the performance of the proposed TPS and TOS methods with other state-of-the-art methods using FID as the evaluation metric. The results show that our method consistently outperforms the MoNCE method in terms of FID. For the horse2zebra and cat2dog datasets, our method achieves the lowest FID, while for the cityscapes (label2photo) dataset, our method with only TOS (without TPS) achieves the lowest FID.

Table 1. FID comparison for different methods.
Methods	horse2zebra	cat2dog	label2photo
CycleGAN	76.90	125.88	135.35
CUT	48.80	231.68	62.62
MoNCE	61.20	75.46	51.74
Ours (w/o TOS)	57.27	67.21	46.58
Ours (w/o TPS)	50.60	61.13	45.54
Ours	47.03	58.79	48.99

3.4. Qualitative Comparisons

Cat2dog We provide a qualitative comparison for the cat2dog dataset in Fig. 5. We observe that the translated images generated by CycleGAN lack eyes, making them unrealistic. CUT's translated images exhibit strong source proximity, resembling the source (cat) more than the target (dog). MoNCE's translated images display strange distortions around the mouse. In comparison, our method produces high-quality images that match the semantic context of the target domain. In Fig. 6, we show another qualitative comparison of the cat2dog dataset. We find that CycleGAN and MoNCE's translated images contain peculiar artifacts and distortions around the dog's ears and eyes, which our method does not exhibit.

Horse2zebra Fig. 7 displays a qualitative comparison of the horsezebra datasets. It can be seen that CycleGAN, CUT, and MoNCE generate blurry artifacts around the zebra body and in the background scenes, while our proposed method outperforms them with significantly better performance.

Cityscapes (label2photo) Qualitative comparisons for the cityscapes dataset (label2photo) are shown in Fig. 8 and Fig. 9. Our method produces high-quality images with excellent color balance, edge information, and image contrast, while exhibiting minimal artifacts and distortions. The performance of other methods, such as CycleGAN, CUT, and MoNCE, is surpassed by our approach in terms of visual quality.

4. Limitations and Future Works

Although our method effectively reduces artifacts and distortions and suppresses source proximity, it is not immune to failure cases (as shown in Fig. 10. In the first example, the horses are correctly translated to zebras, but the background water is mistakenly translated to land. In the second example, the person is erroneously given zebra stripes, while the horse behind it remains unaltered. Our model fails to distinguish the regions that require translation from those that do not. To address this, we plan to utilize TPS and TOS based on multi-layer output from the discriminator instead of just the final layer. This approach will allow our method to capture higher levels of abstraction and better utilize the semantic context of specific regions for image-to-image translation in the future.

Reference

Duhyeon Bang and Hyunjung Shim. Mggan: Solving mode collapse using manifold-guided training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2347–2356, 2021.
Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Divyanth, DS Guru, Peeyush Soni, Rajendra Machavaram, Mohammad Nadimi, and Jitendra Paliwal. Image-to-image translation-based data augmentation for improving crop/weed classification models for precision agriculture applications. Algorithms, 15(11):401, 2022.
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018.
Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualization and computer graphics, 26(11):3365–3385, 2019.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. Semi-supervised learning with gans: Manifold invariance with improved inference. Advances in neural information processing systems, 30, 2017.
Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pages 35–51, 2018.
Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. Advances in neural information processing systems, 30, 2017.
Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4500–4509, 2018.
Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Rongliang Wu, and Shijian Lu. Modulated contrast for versatile image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18280–18290, 2022.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.

Manifold Contrastive Learning for Unpaired Image-to-Image Translation

16726-Learning-Based Image Synthesis Final Project