**Final project for CMU 16-726**
**Improved Synthesis of Human Body and Cloth Texture with Generative Adversarial Network **
*Name: Soyong Shin (soyongs) and Juyong Kim (juyongk)*
(##) Contents
* Part 1: Introduction
* Part 2: Method
* Part 3: Results
* Part 4: Discussion
(##) Part 1: Introduction
![figure [2D_models]: **2D Pose Estimation Models.** a) 2D keypoints estimation result from [#Alphapose]
, b) 2D mask segmentation result from [#Graphonomy]](figures/FigureX_2D_Models.png)
Estimating pose of the human in the images has always been considered a major task in computer vision society due to its vast applications in computer animation, AR/VR, or medical environment.
Following the breakthrough of deep learning architecture, various neural network architectures to predict 2D keypoints (i.e., human body joints) location [#Openpose, #Alphapose] or semantic segmentation mask [#Graphonomy] were developed.
However, extracted 2D information from those models was not fully scalable across the real 3D world since there can be an infinite number of possible 3D human pose candidates that exist given 2D predictions.
![figure [3d_data]: **3D Datasets.** a) [#Human3.6M] dataset, b) [#TotalCapture] dataset, c) [#MPI-INF-3DHP] dataset, d) [#PanopticStudio] dataset](figures/FigureX_3D_Data.png)
The major challenge of extracting 3D information from a single-view image is lack of dataset with 3D label.
Unlike 2D, 3D groundtruth cannot be generated by human annotation, but requires marker-based [#Human3.6M, #TotalCapture] or multi-view vision-based [#MPI-INF-3DHP, #PanopticStudio] motion capture systems which require a large number of calibrated cameras. Therefore, publicly available 3D-labeled datasets have much less variability in subjects and background sceneries compared to 2D datasets [#MS-COCO, #MPII]. Furthermore, those systems only provide 3D keypoints of the human body. As recent computer vision architectures are trying to recover dense predictions of human pose such as 3D vertices, dense correspondences, or 3D volume occupancy (Figure X), there are continuous demands for dense-labeled 3D datasets.
![figure [dense_predictions]: **Recent Models Predicting Dense Human Pose.** a) 3D human mesh regression from SPIN, b) 2D human dense correspondence estimation from DensePose, c) 3D human volume occupancy estimation from BodyNet.](figures/FigureX_DensePredictions.png)
One alternative to overcome limitations of the aforementioned systems is to generate synthetic image data by projecting a 3D human mesh into an image plane. One conventional method of synthetic image, SURREAL dataset, first generates synthetic a human via 3D body model and overlays it onto the background images. The method utilizes 3D scan data to add texture of the given mesh. However, this method does not consider correspondence between the virtual human with the background, therefore it lacks reality.
![figure [surreal]: **SURREAL Dataset.** Conventional synthetic human image dataset.](figures/FigureX_Surreal.png)
In this paper, we propose a new framework to generate synthetic human image dataset with improved synthesis of human body and cloth. Similar to SURREAL dataset, we also leverage a 3D human body model to create a virtual human mesh. However, more than just overlaying the mesh projections with background images, we add textures of human body, cloth, and background using recently proposed generative adversarial network (GAN) architectures. By doing this, we expect our framework can provide more realistic image.
(##) Part 2: Method
**2.1. Overall Pipeline**
![figure [pipeline]: **Overall Pipeline.**](figures/FigureX_Pipeline.png)
Figure X shows the overall pipeline of our framework. In order to generate synthetic data with known 3D labels, we first created a virtual 3D human body via statistical body model.
We randomly sampled the pose, shape and outfits of virtual humans to increase the variability of the synthetic data.
Then, we created a 2D mask and image of the virtual human by projecting on a defined image plane.
Since we sampled global orientation of virtual human, we did not sample the camera, but set focal length as 5,000 *mm* and image size as $480 \times 480$.
Then using the generator network $\mathcal{G}$, we created a synthetic image by adding texture on the given projected image and mask.
The generator is trained in adversarial architecture with discriminator $\mathcal{D}$ using real human image dataset, *MPII*.
**2.2. Generating Virtual Human**
*2.2.1. Human body model*
![figure [smpl_model]: **SMPL model.** (a) template vertices given by the model, (b) adding shape variability by shape parameter, (c) adding shape variability by pose parameter, (d) final output with posed body](figures/FigureX_SMPL.png)
To generate virtual humans with known 3D dense information, we employed rich information of the 3D human body by leveraging the parametric 3D human body model, [#SMPL].
SMPL is a differentiable function, $\mathcal{M}_{smpl}(\theta, \beta)$, which maps a triangulated body mesh $M_{smpl} \in R^{N\times3}$ to given pose $\theta \in R^{J\times 3}$ and shape $\beta \in R^{10}$ parameters.
Pose parameters are a set of 3D rotation vectors representing body segments relative to their parent segments and the global orientation of the body (i.e., root joint rotation), for a total of $J \in 24$ vectors.
The shape parameter vector represents the 10 directions of greatest shape variability, retrieved from Principal Component Analysis.
The reconstructed 3D human body mesh consist of $N=6890$ vertices that can be linearly regressed to the 3D human joint-center location $X_{3D} \in R^{J\times 3}$ from a regression matrix $W$ by $X_{3D} = WM_{smpl}$.
Using SMPL model, we could generate realistic 3D human body mesh with corresponding 3D vertices/keypoints location and joint angles.
*2.2.2. Human cloth model*
![figure [cape_model]: **CAPE model.** (a) SMPL output without dressing, (b) adding displacement from CAPE model](figures/FigureX_CAPE.png)
We further employed a human clothing model, [#CAPE] to dress the generated SMPL human body to make the projected mask looks more realistic.
CAPE model takes three input variables, cloth type, latent vector.
Cloth type $c \in \mathcal{I}^4$ where $\mathcal{I} = {0, 1}$ is a one-hot vector that defines type of outfits among four options, combination of short/long and upper/lower types.
Latent vector $z \in R^{512}$ is sampled from normal distribution, which will be decoded by CAPE model $\mathcal{M}_{cape}$.
Then, CAPE model $\mathcal{M}_{cape}$ generates displacement $\delta(c, z) \in R^{6890 \times 3}$ from the given SMPL mesh $\mathcal{M}_{smpl}(\theta, \beta)$ and creates new vertices $M_{cape} = M_{smpl} + \delta$
**2.3. Texture synthesis**
![Figure [architecture]: Overall model architecture.](figures/FigureX_Architecture.png width=550px)
To put a realistic texture the virtual human body rendering and the background, we applies GAN models which generator takes the human rendering images and outputs images with the same size.
Figure [architecture] shows the overall architecture of our texture synthetic model.
Basically the task our model tries to perform is image-to-image translation task and takes a latent vector as an input to represent the texture being applied to the input image.
The generators of BicycleGAN [#BicycleGAN] and SPADE [#SPADE] satisfy this input requirement, and we try both for our generator $\mathcal{G}$.
The BicycleGAN generator has U-Net architecture and copy the latent vector to all pixels uniformly (See Figure [bicyclegan]).
We use the `add_to_input` way to injecting the latent noise, where the noise is fed into the network at the input layer only.
On the other hand, the SPADE generator only has upsampling and convolutional layers that convert the latent vector to image size and the input image (label map) is applied to the network at multiple scales (See Figure [spade]).
For discriminator $\mathcal{D}$, we try both standard discriminator and patch discriminator.
![Figure [bicyclegan]: `add_to_input` configuration of the BicycleGAN model](figures/FigureX_BicycleGAN.png width=270px) ![Figure [spade]: The SPADE model](figures/FigureX_Spade.png width=380px)
The objective function of our method consists of three losses.
First, the genarator $\mathcal{G}$ and the discriminator $\mathcal{D}$ induce the standard GAN loss, which is applied to both generator and discriminator.
For the GAN objective, we use LSGAN which is known to speed up the convergence.
In addition to the GAN loss, we can utilize a pretrained keypoint detector and a pretrained segmentation model to supervise the generator to output the human on same locations and poses.
From the generator output, we extract the 2D joint locations with the SPPE model (**SPPE citation**) and computes a $L_1$ loss between the detected joint locations and the groundtruth joint locations.
Similarly, the Graphonomy model [#Graphonomy] computes the human segmentation, which induces $L_1$ loss compared to the groundtruth from the rendering.
The final loss of our method is as follows:
$$ \mathcal{L}_\mathcal{G}(x, z) = \mathbb{E}_{x,z} \Big[(\mathcal{D}(\mathcal{G}(x, z)) - 1)^2 + \lambda_\text{key} \| f_\text{SPPE}(\mathcal{G}(x, z)) - x_\text{key} \|_1 ~~~~~~~~~~~ \\
~~~~~~~~~~~~~~~ + \lambda_\text{seg} \| f_\text{Graphonomy}(\mathcal{G}(x, z)) - x_\text{seg} \|_1 \Big], \\
\mathcal{L}_\mathcal{D}(x, z, y) = \mathbb{E}_{x, z}[\mathcal{D}(\mathcal{G}(x, z))^2] + \mathbb{E}_y[(\mathcal{D}(y) - 1)^2], $$
where $x, z, y$ are the input rendering, the latent vector, and the real human image, respectively, and $x_\text{seg}$ and $x_\text{key}$ are the segmentation and joint keypoint groundtruth, respectively.
We use $\lambda_\text{key} = \lambda_\text{seg} = 1$ unless specified otherwise.
(##) Part 3: Results
In this section we present the experimental results of the texture synthesis result.
For real human image, we use MPII Human Pose Database [#MPII] which provides 25k images of 40k+ people with bounding box annotations.
We cropped the image by the human bounding boxes and resize to 256x256.
Figure [real-image-1] and Figure [real-image-2] show random examples of real human images used in the training.
To create the synthetic human image, we randomly sampled human pose from a pose prior and sampled the cloth latent vector randomly.
The synthetic human mesh is generate with the CAPE model and the rendered image is fit to 256x256 size by a preset camera intrinsic.
Figure [synthetic-image-1] and Figure [synthetic-image-2] show some random synthetic human rendering image, which fed into the generator.
![figure [real-image-1]: real image](figures/bicyclegan/epoch_00181_data_B.png) ![Figure [real-image-2]: Real image](figures/bicyclegan/epoch_00182_data_B.png) ![Figure [synthetic-image-1]: Synthetic human image](figures/bicyclegan/epoch_00181_data_A.png) ![Figure [synthetic-image-2]: Synthetic human image](figures/bicyclegan/epoch_00182_data_A.png)
For the model training, we trained the model with batch size 50 on Quadro RTX 8000 48GB GPU, until it fully converges and stabilizes.
All the results did not synthesize realistic texture of human, or create a realistic background.
Instead, we compare the qualitative results for future research.
**3.1. BicycleGAN experiments**
We train the model with the BicycleGAN generator in two cases: 1) when only GAN loss is imposed (i.e. $\lambda_\text{key} = \lambda_\text{seg} = 0$) and 2) all losses are used.
When only GAN loss is used with the BicycleGAN generator, the model collapsed to generate the image with the same image, regardless of the input rendering.
This is predictable since the model has no supervision to generate image depending on the rendering input.
Even with other losses, the output images only contains vague silhoutte of human, and the texture seems similar across the outputs.
![figure [bicycle-ganonly-1]: BicycleGAN, GAN loss](figures/bicyclegan_ganonly/epoch_00199_output.png) ![figure [bicycle-ganonly-2]: BicycleGAN, GAN loss](figures/bicyclegan_ganonly/epoch_00200_output.png) ![figure [bicycle-allloss-1]: BicycleGAN, All losses](figures/bicyclegan/epoch_00184_output.png) ![figure [bicycle-allloss-2]: BicycleGAN, All losses](figures/bicyclegan/epoch_00185_output.png)
**3.2. SPADE experiments**
When we use the SPADE generator, the output contains the silhoutte of the input rendering, which is because the rendering is fed at multiple stage of generator.
However, there also is a tendency the overall texture of the output images, including background, collapsed to a same image.
![figure [spade-ganonly-1]: SPADE, GAN loss](figures/spade_ganonly/epoch_00399_output.png) ![figure [spade-ganonly-2]: SPADE, GAN loss](figures/spade_ganonly/epoch_00400_output.png) ![figure [spade-allloss-1]: SPADE, All losses](figures/spade/epoch_00199_output.png) ![figure [spade-allloss-2]: SPADE, All losses](figures/spade/epoch_00200_output.png)
**3.2. SPADE + Global loss experiments**
In addition to the patch discriminator loss that we have used above, we added the single number output to the discriminator to predict whether the whole input image is real or fake.
We expect this will resolve the problem of collapsed texture by generating more diverse global configuration of texture, since otherwise the discriminator will detect as fake easily.
To see the effect of the global GAN loss, we removed the other two losses when training.
As we see in Figure [spade-global-1]~Figure [spade-global-4], the outputs show more diverse textures.
However, the outputs does not follow the human rendering input, and we conjecture that this is because we did not put the segmentation and the keypoint losses to make the output align the input.
![figure [spade-global-1]: SPADE, Patch+Global GAN](figures/spade_ganonly_global/epoch_00348_output.png width=168px) ![figure [spade-global-2]: SPADE, Patch+Global GAN](figures/spade_ganonly_global/epoch_00349_output.png width=168px) ![figure [spade-global-3]: SPADE, Patch+Global GAN](figures/spade_ganonly_global/epoch_00350_output.png width=168px) ![figure [spade-global-4]: SPADE, Patch+Global GAN](figures/spade_ganonly_global/epoch_00347_output.png width=168px)
(##) Part 4: Discussion
**Limitations**
*Poor quality*
Overall, we failed to synthesize realistic quality of synthetic human and background. To achieve the results we desire, we need to reduce the problem to an easier one and analyze the results.
*Unreliability of 3D label*
Our framework generates 2D projection images and masks of virtual humans with known 3D labels including vertices and keypoints location. However, our framework uses 2D constraints as segmentation and keypoints loss. Therefore, it is still not 100% reliable that a given synthetic image exactly matches with the 3D label we provide.
(#) Bibliography
[#Alphapose]: H. Fang, S. Xie, Y. Tai and C. Lu. "RMPE: Regional Multi-person Pose Estimation.", In *Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV)*, 2017, https://arxiv.org/abs/1612.00137
[#Graphonomy]: K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang and L. Lin. "Graphonomy: Universal Human Parsing via Graph Transfer Learning", In *Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2019, https://arxiv.org/abs/1904.04536
[#Openpose]: Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei and Y. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2019, https://arxiv.org/abs/1812.08008
[#SMPL]: M. Loper, N. Mahmood, J. Romero, G. Pons-Moll and M. Black. "SMPL: A Skinned Multi-Person Linear Model", In *Proceedings of SIGGRAPH ASIA*, 2015
[#CAPE]: Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. Black, "Learning to Dress 3D People in Generative Clothing", In *Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2020, https://arxiv.org/pdf/1907.13615.pdf
[#Human3.6M]: C. Lonescu, D. Papava, V. Olaru and C. Sminchisescu, "Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments", *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2014
[#TotalCapture]: M. Trumble, A. Gilbert, C. Malleson, A. Hilton and J. Collomosse, "Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors", *British Machine Vision Conference (BVMC)*, 2017
[#MPI-INF-3DHP]: D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu and C. Theobalt, "Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision", In *Proceedings of International Conference on 3D Vision (3DV)*, 2017
[#PanopticStudio]: H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara and Y. Sheikh, "Panoptic Studio: A Massively Multiview System for Social Motion Capture", In *Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV)*, 2015
[#MS-COCO]: T. Lin, M. Maire, S. Belongie, L. Bourdev, G. Ross, J. Hays, P. Perona, D. Ramanan, C.L. Zitnick, and P. Dollàr, "Microsoft COCO: Common Objects in Context", *arXiv*, 2014, https://arxiv.org/abs/1405.0312
[#MPII]: M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele., "2D Human Pose Estimation: New Benchmark and State of the Art Analysis", In *Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)* 2014,
[#BicycleGAN]: Zhu, Jun-Yan, et al. "Toward multimodal image-to-image translation." in *Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS)* 2017.
[#SPADE]: Park, Taesung, et al. "Semantic image synthesis with spatially-adaptive normalization." in *Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)* 2019.